Giter Site home page Giter Site logo

heche-psb / wgd Goto Github PK

View Code? Open in Web Editor NEW

This project forked from arzwa/wgd

21.0 1.0 0.0 192.97 MB

wgd v2: a suite of tools to uncover and date ancient polyploidy and whole-genome duplication

Home Page: https://wgdv2.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Python 100.00%
genomics paleobiology timing wgd

wgd's Introduction

wgd v2 : a suite tool of WGD inference and timing

Build Status Documentation Status license Latest PyPI version install with bioconda Anaconda.org Downloads

Hengchi Chen, Arthur Zwaenepoel, Yves Van de Peer

Bioinformatics & Evolutionary Genomics Group, VIB-UGent Center for Plant Systems Biology

Introduction | Installation | Pipelines | Parameters | Usage | Illustration | Documentation | Citation

wgd v2 is a python package upgraded from the original wgd package aiming for the inference and timing of ancient whole-genome duplication (WGD) events. For the propose of illustrating the principle and usage of wgd v2, we compiled this documentation. Below we first gave an introduction over the scope and mechanism of wgd v2 and then the practical information pertaining to the installation and usage. An examplar workflow is provided in the tutorial section on how to seek evidence for a putative WGD event and perform proper timing with a freshly obtained genome assembly in hand. For those who are interested in more theoretical details, we recommend turning to our paper and book chapter for more detailed description and insightful discussions. The key improved features of wgd v2 are demonstrated in our latest manuscript on Bioinformatics. If you use wgd v2 in your research, please cite us as suggested in Citation section.

Introduction

Polyploidizations, the evolutionary process that the entire genome of an organism is duplicated, also named as whole-genome duplications (WGDs), occur recurrently across the tree of life. There are two modes of polyploidizations, autopolyploidizations and allopolyploidizations. Autopolyploidizations are the duplication of the same genome, resulting in two identical subgenomes at the time it emerged. While the allopolyploidizations are normally achieved in two steps, first the hybridization between two different species, resulting in the arising of transient homoploidy,second the duplication of the homoploidy, resulting in the emergence of allopolyploidy. Due to the unstability and unbalanced tetrasomic inheritance, for instance the nuclear-cytoplasmic incompatibility, the polyploidy genome will then experience a process called diploidization, also named as fractionation, during which a large portion of gene duplicates will get lost and only a fraction can be retained. The traces of polyploidizations can be thus unearthed from these retained gene duplicates. Three approaches based on gene duplicates, namely, KS method, gene tree - species tree reconciliation method and synteny method, are commonly used in detecting evidence for WGDs. The gene tree - species tree reconciliation method is not within the scope of wgd v2, but we kindly refer readers who are interested to the phylogenomic program developed by Arthur Zwaenepoel named WHALE and the associated paper for more technical and theoretical details.

The KS method is established on a model of gene family evolution that each gene family is allowed to evolve via gene duplication and loss. Note that the gene family here is assumed to be the cluster of all genes descended from an ancestral gene in a single genome. Recovering the gene tree of such gene family informs the timing, scilicet the age, of gene duplication events. The age refered here, is not in real geological time, but in the unit of evolutionary distance, i.e., the number of substitutions per site. When the evolutionary rate remains approximately constant, the evolutionary distance is then supposed to be proportional to the real evolutionary time. The synonymous distance KS, the number of synonymous substitutions per synonymous site, is such candidate that synonymous substitutions would not incur the change of amino acid and are thus regarded as neutral, which according to the neutral theory should occur in constant rate. Given a model of gene family that allows the gene to duplicate and get lost in a fixed rate, one can derive that the probability density function of the KS age distribution of retained gene duplicates is a quasi-exponential function that most retained gene duplicates are recently borned with ~0 age while as the age going older the associated number of retained gene duplicates decay quasi-exponentially. Therefore, the occurance of large-scale gene duplication events, for instane WGDs, with varied retention rate, will leave an age peak from the burst of gene duplicates in a short time-frame upon the initial age distribution, and can be unveiled from mixture modeling analysis. However, WGDs identified from the paralogous KS age distributions can only inform the WGD timing in the time-scale of that specific species, which is not comparable in the phylogenetic context. Only with the orthologous KS age distributions, which convert the estimated body from paralogues to orthologues and inform the relative timing of speciation events, can we decipher the phylogenetic placement of WGDs after proper rate correction. wgd v2 is such program that helps users construct paralogous and orthologous KS age distributions and realize both the identification and placement of WGDs.

In the premise of phylogenetically located WGDs, the absolute age (in geological time) of WGDs can also be inferred from those WGD-retained gene duplicates, although there has been no easy or straightforward pipeline for this job so far. In wgd v2, we developed a feasible integrated pipleline for absolute dating of WGDs. The pipeline can be roughly divided into three main steps. 1) The construction of anchor KS distribution and the delineation of crediable KS range adopted for phylogenetic dating, using wgd dmd, wgd ksd, wgd syn and wgd peak. Note that here we only consider genome assembly because for transcriptome assembly it's impossible to distinguish WGD-derived duplicates from small-scale duplication-derived duplicates, which happened in a continuous time-frame instead of only a separate short time-frame and thus reflects the duration of that branch rather than the time at which WGD occurred. 2) The formulation of a starting tree used in the phylogenetic dating, composed of a few species and annotated with fossil calibration information. This step is essential for the result of absolute WGD dating that we suggest users of taking great caution to assure the tree topology and proper bounds for fossil calibrations. 3) The construction of orthogroups consisting of collinear duplicates of the focal species and their reciprocal best hits (RBHs) against other species in the starting tree and the phylogenetic dating using a molecular dating program for instance mcmctree, via program wgd dmd and wgd focus. We recommend the usage of Bayesian molecular dating program mcmctree, which provides a variety of different substitution and rate models. Nonetheless, we urge users to set prior distribution of different parameters with caution and assure adequate sampling of different parameters.

Installation

The easiest way to install wgd v2 is using PYPI. Note that if you want to get the latest update, we suggest installing from the source, since the update on PYPI will be delayed compared to here of source. Nonetheless, to install a stable version that has been well tested, we recommend version 2.0.38 by now. We suggest of adding the installation of numpy version 1.19.0 before wgd because pip can not resolve the requirements very well that the installation order of different dependencies might lead to incompatibility. Python version 3.6.5 or 3.8.0 (or nearby versions) are well tested to be compatible. We strongly recommend creating a virtual environment no matter you install from source, PYPI or bioconda, which can be achieved by the command below.

git clone https://github.com/heche-psb/wgd
cd wgd
virtualenv -p=python3 ENV (or python3/python -m venv ENV)
source ENV/bin/activate
pip install numpy==1.19.0
pip install wgd==2.0.38

To install from source, the following command lines could be used.

git clone https://github.com/heche-psb/wgd
cd wgd
virtualenv -p=python3 ENV (or python3 -m venv ENV)
source ENV/bin/activate
pip install numpy==1.19.0
pip install -r requirements.txt
pip install .

When met with permission problem in installation, please try the following command line.

pip install -e .

If multiple versions of wgd were installed in the system, please add the right path of interested version into the environment variables, for example

export PATH="$PATH:~/.local/bin/wgd"

Note that the version of numpy is important (for many other packages are the same of course), especially for fastcluster package. In our test, the numpy 1.19.0 works fine on python3.6/8. If you met some errors or warnings about numpy, maybe considering pre-install numpy as 1.19.0 or other close-by versions before you install wgd. wgd relies on external softwares including diamond and mcl for wgd dmd, paml v4.9j, mafft (muscle and prank if set), fasttree (or iqtree if set) for wgd ksd and optionally mrbayes for the phylogenetic inference function in wgd dmd and wgd focus (mafft, muscle and prank as well when the analysis requires sequence alignment). Some other optional softwares including paml v4.9j, r8s, beast, eggnog, diamond, interproscan, hmmer v3.1b2 and astral-pro (hmmer v3.1b2 is also for the orthogroup assignment function in wgd dmd and astral-pro is also for the function --collinearcoalescence in wgd dmd) are for the molecular dating, gene family function annotation or phylogenetic inference in wgd focus.

Pipelines

To quickly get familiar with wgd v2, we provided some common pipelines for users with a fresh genome assembly in hand.

Pipeline 1

Simple construction of age distribution

wgd dmd Aquilegia_coerulea -o wgd_dmd
wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd
#what will be in the result directory
-wgd_dmd
--Aquilegia_coerulea.tsv
-wgd_ksd
--Aquilegia_coerulea.tsv.ks.tsv Aquilegia_coerulea.tsv.ks.svg/pdf

The resulting Aquilegia_coerulea.tsv is the whole paranome family file, Aquilegia_coerulea.tsv.ks.tsv file is the KS distribuiton file, Aquilegia_coerulea.tsv.ks.svg and Aquilegia_coerulea.tsv.ks.pdf are the KS plots.

Pipeline 2

Simple construction of age distribution with collinearity

wgd dmd Aquilegia_coerulea -o wgd_dmd
wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd
wgd syn -f mRNA -a Name wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea.gff3 -ks wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -o wgd_syn
# what will be in the wgd_syn result directory
-wgd_syn
--anchors.csv
--families.tsv
--gene-table.csv
--iadhore.conf
--scaffold_length.tsv
--segments_coordinates.tsv
--Segprofile.csv
--Syndepth.pdf/svg
--Aquilegia_coerulea_Aquilegia_coerulea_multiplicons_level.pdf/png/svg
--Aquilegia_coerulea_gene_order_perchrom.tsv
--Aquilegia_coerulea.tsv.anchors.ks.tsv
--Aquilegia_coerulea.tsv.ksd.pdf/svg
--Aquilegia_coerulea-vs-Aquilegia_coerulea.dot.pdf/png/svg
--Aquilegia_coerulea-vs-Aquilegia_coerulea.dot_unit_gene.pdf/png/svg
--Aquilegia_coerulea-vs-Aquilegia_coerulea_Ks.dot.pdf/png/svg
--Aquilegia_coerulea-vs-Aquilegia_coerulea_Ks.dot_unit_gene.pdf/png/svg
--iadhore-out
---alignment.txt
---anchorpoints.txt
---baseclusters.txt
---genes.txt
---list_elements.txt
---multiplicon_pairs.txt
---multiplicons.txt
---segments.txt

The anchors.csv, families.tsv, gene-table.csv, scaffold_length.tsv, segments_coordinates.tsv, Segprofile.csv, Aquilegia_coerulea_gene_order_perchrom.tsv are the basic files summarizing the gene order, family, anchors and collinear segments. The Aquilegia_coerulea.tsv.anchors.ks.tsv is the KS distribution file of anchor pairs. The Syndepth.pdf/svg shows the collinear ratio of multiplicons. The Aquilegia_coerulea_Aquilegia_coerulea_multiplicons_level.pdf/png/svg is the "dupStack" plot showing multiplicons with different levels (defined as the number of segments within). The Aquilegia_coerulea.tsv.ksd.pdf/svg is the KS plot with anchor pairs annotated. The Aquilegia_coerulea-vs-Aquilegia_coerulea.dot.pdf/png/svg is the dot plot without KS annotation and with coordinates in the number of base. The Aquilegia_coerulea-vs-Aquilegia_coerulea.dot_unit_gene.pdf/png/svg is the dot plot without KS annotation and with coordinates in the number of gene. The Aquilegia_coerulea-vs-Aquilegia_coerulea_Ks.dot.pdf/png/svg is the dot plot with KS annotation and with coordinates in the number of base. The Aquilegia_coerulea-vs-Aquilegia_coerulea_Ks.dot_unit_gene.pdf/png/svg is the dot plot with KS annotation and with coordinates in the number of gene. The iadhore.conf is the configuration file for i-adhore. The iadhore-out subfolder contains the original collinear results from i-adhore (please refer to the manual therein for detailed description).

Pipeline 3

Construction of age distribution and ELMM analysis

wgd dmd Aquilegia_coerulea -o wgd_dmd
wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd
wgd viz -d wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -o wgd_ELMM
# what will be in the wgd_ELMM result directory
-wgd_ELMM
--Aquilegia_coerulea.tsv.ks.tsv.ksd.pdf/svg
--Aquilegia_coerulea.tsv.ks.tsv.spline_node_averaged.pdf/svg
--Aquilegia_coerulea.tsv.ks.tsv.spline_weighted.pdf/svg
--Aquilegia_coerulea.tsv.ks.tsv_peak_detection_node_averaged.pdf/svg
--Aquilegia_coerulea.tsv.ks.tsv_peak_detection_weighted.pdf/svg
--elmm_Aquilegia_coerulea.tsv.ks.tsv_best_models_node_averaged.pdf/svg
--elmm_Aquilegia_coerulea.tsv.ks.tsv_best_models_weighted.pdf/svg
--elmm_Aquilegia_coerulea.tsv.ks.tsv_models_data_driven_node_averaged.pdf/svg
--elmm_Aquilegia_coerulea.tsv.ks.tsv_models_data_driven_weighted.pdf/svg
--elmm_Aquilegia_coerulea.tsv.ks.tsv_models_random_node_averaged.pdf/svg
--elmm_Aquilegia_coerulea.tsv.ks.tsv_models_random_weighted.pdf/svg
--elmm_BIC_Aquilegia_coerulea.tsv.ks.tsv_node_averaged.pdf/svg
--elmm_BIC_Aquilegia_coerulea.tsv.ks.tsv_weighted.pdf/svg

The Aquilegia_coerulea.tsv.ks.tsv.ksd.pdf/svg is the basic KS plot. The Aquilegia_coerulea.tsv.ks.tsv.spline_node_averaged.pdf/svg and Aquilegia_coerulea.tsv.ks.tsv.spline_weighted.pdf/svg are the spline plots used for peak detection. The Aquilegia_coerulea.tsv.ks.tsv_peak_detection_node_averaged.pdf/svg and Aquilegia_coerulea.tsv.ks.tsv_peak_detection_weighted.pdf/svg are the results of peak detection. The elmm_Aquilegia_coerulea.tsv.ks.tsv_best_models_node_averaged.pdf/svg and elmm_Aquilegia_coerulea.tsv.ks.tsv_best_models_weighted.pdf/svg are the results from the best model informed by BIC. The elmm_Aquilegia_coerulea.tsv.ks.tsv_models_data_driven_node_averaged.pdf/svg and elmm_Aquilegia_coerulea.tsv.ks.tsv_models_data_driven_weighted.pdf/svg are the model results from data-driven initiation. The elmm_Aquilegia_coerulea.tsv.ks.tsv_models_random_node_averaged.pdf/svg and elmm_Aquilegia_coerulea.tsv.ks.tsv_models_random_weighted.pdf/svg are the model results from random initiation. The elmm_BIC_Aquilegia_coerulea.tsv.ks.tsv_node_averaged.pdf/svg and elmm_BIC_Aquilegia_coerulea.tsv.ks.tsv_weighted.pdf/svg are the BIC plots for each model.

Pipeline 4

Construction of age distribution with collinearity and peak finding analysis

wgd dmd Aquilegia_coerulea -o wgd_dmd
wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd
wgd syn -f mRNA -a Name wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea.gff3 -ks wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -o wgd_syn
wgd peak --heuristic wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -ap wgd_syn/iadhore-out/anchorpoints.txt -sm wgd_syn/iadhore-out/segments.txt -le wgd_syn/iadhore-out/list_elements.txt -mp wgd_syn/iadhore-out/multiplicon_pairs.txt -n 1 4 -kc 3 -o wgd_peak
# what will be in the wgd_peak result directory
-wgd_peak
--AnchorKs_FindPeak
---AnchorKs_PeakCI_Aquilegia_coerulea.tsv.ks.tsv_node_averaged/weighted.pdf
---Aquilegia_coerulea.tsv.ks.tsv_95%CI_AP_for_dating_weighted_format.tsv
---Aquilegia_coerulea.tsv.ks.tsv_95%CI_AP_for_dating_weighted.tsv
---Aquilegia_coerulea.tsv.ks.tsv_95%CI_AP_for_dating_node_averaged_format.tsv
---Aquilegia_coerulea.tsv.ks.tsv_95%CI_AP_for_dating_node_averaged.tsv
--AnchorKs_GMM
---GMM_Elbow-Loss_original_Ks.pdf
---GMM_Original_AnchorKs_Clustering_Silhouette_Coefficient.pdf
---Original_AnchorKs_GMM_1/2/3/4components_prediction.tsv
---Original_AnchorKs_GMM_AIC_BIC.pdf
---Original_AnchorKs_GMM_Component1/2/3/4_node_averaged_Lognormal.pdf
---Original_AnchorKs_GMM_Component1/2/3/4_node_averaged.pdf
---LogGMM_CI
----GMM_1/2/3/4components_C0/1/2/3_95%CI.tsv
----GMM_Component1/2/3/4_node_averaged_Lognormal.pdf
---HighMass_CI
----GMM_1/2/3/4components_C0/1/2/3_HighMass_95%CI.tsv
----GMM_1/2/3/4components_HighMass_95%CI.pdf
--SegmentGuideKs_GMM
---GMM_Elbow-Loss_Segment_Ks.pdf
---GMM_Segment_Ks_Clustering_Silhouette_Coefficient.pdf
---Segment-guided_AnchorKs_GMM_1/2/3/4components_prediction.tsv
---Segment_Ks_Clusters_GMM_Component1/2/3/4.pdf
---Segment_Ks_Clusters_Lognormal_GMM_Component1/2/3/4.pdf
---Segment_Ks_GMM_AIC_BIC.pdf
---Segment_Ks.tsv
---HighMass_CI
----Segment_guided_1/2/3/4components_C0/1/2/3_HighMass_95%CI.tsv
----Segment_guided_1/2/3/4components_HighMass_95%CI.pdf
---HDR_CI
----Segment_guided_95%HDR_AP_1/2/3/4components_C0/1/2/3.tsv
----Segment_guided_AnchorKs_GMM_Component1/2/3/4_node_averaged_kde.pdf
----Segment_guided_AnchorKs_GMM_Component1/2/3/4_node_averaged.pdf
--SegmentKs_FindPeak
---SegmentKs_PeakCI_Aquilegia_coerulea.tsv.ks.tsv.pdf
---Peak_1/2_Segment_guided_Aquilegia_coerulea.tsv.ks.tsv_95%CI_MP_for_dating_format.tsv
---Peak_1/2_Segment_guided_Aquilegia_coerulea.tsv.ks.tsv_95%CI_MP_for_dating.tsv

Four result subfloders will be produced, namely AnchorKs_FindPeak, AnchorKs_GMM, SegmentGuideKs_GMM and SegmentKs_FindPeak. The AnchorKs_FindPeak subfloder contains results of the detected peaks by the signal module of SciPy library and the assumed highest mass part (referred to as HighMass hereafter) of each peak, which can be used for further WGD dating. The AnchorKs_GMM shows the GMM results upon the original anchor KS distribution by the mixture module of scikit-learn library and two subfloders, LogGMM_CI containing the results of 95% CI of each component, HighMass_CI containing the HighMass of each component, which can be used for further WGD dating. The SegmentGuideKs_GMM subfolder presents results of segment KS GMM which are mapped back to the residing anchor pairs and the associated 95% HDR and HighMass of each segment cluster in subfloders of HDR_CI and HighMass_CI. The SegmentKs_FindPeak subfolder is similar to AnchorKs_FindPeak but with segment KS instead. The KS in Multiplicon can also be calculated in place of Segment using the option --guide as such the result title, label, file and folder names will be changed accordingly.

Pipeline 5

Construction of age distribution with collinearity and rate correction

wgd dmd Aquilegia_coerulea -o wgd_dmd
wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd
wgd syn -f mRNA -a Name wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea.gff3 -ks wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -o wgd_syn
wgd dmd --globalmrbh Aquilegia_coerulea Protea_cynaroides Acorus_americanus Vitis_vinifera -o wgd_globalmrbh
wgd ksd wgd_globalmrbh/global_MRBH.tsv Aquilegia_coerulea Protea_cynaroides Acorus_americanus Vitis_vinifera -o wgd_globalmrbh_ks
wgd viz -d wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv -fa Aquilegia_coerulea -epk wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -ap wgd_syn/iadhore-out/anchorpoints.txt -sp speciestree.nw -o wgd_viz_mixed_Ks --plotelmm --plotapgmm --reweight
# what will be in the wgd_viz_mixed_Ks result directory
-wgd_viz_mixed_Ks
--All_pairs.ks.node.weighted.pdf
--Focus_sister_pairs.ks.node.weighted.pdf
--global_MRBH.tsv.ks.tsv.ksd.pdf/svg
--Mixed.ks.Aquilegia_coerulea.node.weighted.pdf
--spair.corrected.ks.info.tsv
--spair.original.ks.info.tsv
--Simple_Ks_Distributions
---Acorus_americanus/Aquilegia_coerulea/Protea_cynaroides/Vitis_vinifera__Aquilegia_coerulea/Protea_cynaroides/Vitis_vinifera.ks.node.weighted.pdf

The All_pairs.ks.node.weighted.pdf is the KS plot of all species pairs. The Focus_sister_pairs.ks.node.weighted.pdf is the KS plot of all focal-sister species pairs. The global_MRBH.tsv.ks.tsv.ksd.pdf/svg is the KS plot of the datafile global_MRBH.tsv.ks.tsv. The Mixed.ks.Aquilegia_coerulea.node.weighted.pdf is the final result of rate correction (with mixture modeling results if set). The spair.corrected.ks.info.tsv and spair.original.ks.info.tsv document the KS information of all species pairs before and after rate correction. The subfolder Simple_Ks_Distributions contains the single KS plots of all species pairs.

Pipeline 6

Construction of age distribution with collinearity and WGD dating

wgd dmd Aquilegia_coerulea -o wgd_dmd
wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd
wgd syn -f mRNA -a Name wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea.gff3 -ks wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -o wgd_syn
wgd peak --heuristic wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -ap wgd_syn/iadhore-out/anchorpoints.txt -sm wgd_syn/iadhore-out/segments.txt -le wgd_syn/iadhore-out/list_elements.txt -mp wgd_syn/iadhore-out/multiplicon_pairs.txt -n 1 4 -kc 3 -o wgd_peak
wgd dmd -f Aquilegia_coerulea -ap wgd_peak/AnchorKs_FindPeak/Aquilegia_coerulea.tsv.ks.tsv_95%CI_AP_for_dating_weighted_format.tsv -o wgd_dmd_ortho Potamogeton_acutifolius Spirodela_intermedia Amorphophallus_konjac Acanthochlamys_bracteata Dioscorea_alata Dioscorea_rotundata Acorus_americanus Acorus_tatarinowii Tetracentron_sinense Trochodendron_aralioides Buxus_austroyunnanensis Buxus_sinica Nelumbo_nucifera Telopea_speciosissima Protea_cynaroides Aquilegia_coerulea
wgd focus --protcocdating --aamodel lg wgd_dmd_ortho/merge_focus_ap.tsv -sp dating_tree.nw -o wgd_dating -d mcmctree -ds 'burnin = 2000' -ds 'sampfreq = 1000' -ds 'nsample = 20000' Potamogeton_acutifolius Spirodela_intermedia Amorphophallus_konjac Acanthochlamys_bracteata Dioscorea_alata Dioscorea_rotundata Acorus_americanus Acorus_tatarinowii Tetracentron_sinense Trochodendron_aralioides Buxus_austroyunnanensis Buxus_sinica Nelumbo_nucifera Telopea_speciosissima Protea_cynaroides Aquilegia_coerulea
# what will be in the wgd_dating result directory
-wgd_dating
--Concatenated.paln
--Concatenated.paln.paml
--G2S.Map
--GF00000001.paln
--..
--GF00000187.paln
--mcmctree
---Concatenated
----pep
-----Concatenated.paln.paml
-----dating_tree.nw
-----FigTree.tre
-----in.BV
-----lg.dat
-----lnf
-----mcmctree.ctrl
-----mcmctree.out
-----mcmc.txt
-----rates
-----rst
-----rst1
-----rub
-----tmp0001.ctl
-----tmp0001.out
-----tmp0001.trees
-----tmp0001.txt

The Concatenated.paln and Concatenated.paln.paml are the concatenated protein alignments in fasta and paml format. The G2S.Map is the map between gene and species names. The GF00000001.paln,.. and GF00000187.paln are the protein alignments for each gene family. The mcmctree subfolder contains the dating results for the concatenated family (and per gene family if set). The deeper Concatenated subfolder contains the dating results of concatenated protein alignment (or nucleotide alignment if set). The deepest pep subfolder contains the final dating results (of concatenated protein alignment in this case). Please refer to mcmctree manual for detailed description of each file produced by mcmctree. The important result files are FigTree.tre, mcmctree.out and mcmc.txt which document the final date estimation, log information and posterior samples for each node respectively.

Parameters

There are 7 main programs in wgd v2: dmd,focus,ksd,mix,peak,syn,viz. Hereafter we will provide a detailed elucidation on each of the program and its associated parameters. Please refer to the Usage for the scenarios to which each parameter applies.

The program wgd dmd can realize the delineation of whole paranome, RBHs (Reciprocal Best Hits), MRBHs (Multiple Reciprocal Best Hits), orthogroups and some other orthogroup-related functions, including circumscription of nested single-copy orthogroups (NSOGs), unbiased test of single-copy orthogroups (SOGs) over missing inparalogs, construction of BUSCO-guided single-copy orthogroups (SOGs),and the collinear coalescence inference of phylogeny.

wgd dmd sequences (option)
--------------------------------------------------------------------------------
-o, --outdir, the output directory, default wgd_dmd
-t, --tmpdir, the temporary working directory, default None, if None was given, the tmpdir would be assigned random names in current directory and automately removed at the completion of program, else the tmpdir would be kept
-p, --prot, flag option, whether using protein or nucleotide sequences
-c, --cscore, the c-score to restrict the homolog similarity of MRBHs, default None, if None was given, the c-score funcion wouldn't be activated, else expecting a decimal within the range of 0 and 1
-I, --inflation, the inflation factor for MCL program, default 2.0, with higher value leading to more but smaller clusters
-e, --eval, the e-value cut-off for similarity in diamond and/or hmmer, default 1e-10
--to_stop, flag option, whether to translate through STOP codons, if the flag was set, translation will be terminated at the first in-frame stop codon, else a full translation continuing on passing any stop codons would be initiated
--cds, flag option, whether to only translate the complete CDS that starts with a valid start codon and only contains a single in-frame stop codon at the end and must be dividable by three, if the flag was set, only the complete CDS would be translated
-f, --focus, the species to be merged on local MRBHs, default None, if None was given, the local MRBHs wouldn't be inferred
-ap, --anchorpoints, the anchor points data file from i-adhore for constructing the orthogroups with anchor pairs, default None
-sm, --segments, the segments datafile used in collinear coalescence analysis if initiated, default None
-le, --listelements, the listsegments data file used in collinear coalescence analysis if initiated, default None
-gt, --genetable, the gene table datafile used in collinear coalescence analysis if initiated, default None
-coc, --collinearcoalescence, flag option, whether to initiate the collinear coalescence analysis, if the flag was set, the analysis would be initiated
-kf, --keepfasta, flag option, whether to output the sequence information of MRBHs, if the flag was set, the sequences of MRBHs would be in output
-kd, --keepduplicates, flag option, whether to allow the same gene to occur in different MRBHs (only meaningful when the cscore was used), if the flag was set, the same gene could be assigned to different MRBHs
-gm, --globalmrbh, flag option, whether to initiate global MRBHs construction, if the flag was set, the --focus option would be ignored and only global MRBHs would be built
-n, --nthreads, the number of threads to use, default 4
-oi, --orthoinfer, flag option, whether to initiate orthogroup infernece, if the flag was set, the orthogroup infernece program would be initiated
-oo, --onlyortho, flag option, whether to only conduct orthogroup infernece, if the flag was set, only the orthogroup infernece pipeline would be performed while the other analysis wouldn't be initiated
-gn, --getnsog, flag option, whether to initiate the searching for nested single-copy gene families (NSOGs) (only meaningful when the orthogroup infernece pipeline was activated), if the flag was set, additional NSOGs analysis would be performed besides the basic orthogroup infernece
-tree, --tree_method, which gene tree inference program to invoke (only meaningful when the collinear coalescence, gene-to-family assignment or NSOGs analysis were activated), default fasttree
-ts, --treeset, the parameters setting for gene tree inference, default None, this option can be provided multiple times
-mc, --msogcut, the ratio cutoff for mostly single-copy family (meaningful when activating the orthogroup infernece pipeline) and species representation in collinear coalescence analysis, default 0.8.
-ga, --geneassign, flag option, whether to initiate the gene-to-family assignment analysis, if the flag was set, the analysis would be initiated
-sa, --seq2assign, the queried sequences data file in gene-to-family assignment analysis, default None, this option can be provided multiple times
-fa, --fam2assign, the queried familiy data file in gene-to-family assignment analysis, default None
-cc, --concat, flag option, whether to initiate the concatenation pipeline for orthogroup infernece, if the flag was set, the analysis would be initiated
-te, --testsog, flag option, whether to initiate the unbiased test of single-copy gene families, if the flag was set, the analysis would be initiated
-bs, --bins, the number of bins divided in the gene length normalization, default 100
-np, --normalizedpercent, the percentage of upper hits used for gene length normalization, default 5
-nn, --nonormalization, flag option, whether to call off the normalization, if the flag was set, no normalization would be conducted
-bsog, --buscosog, flag option, whether to initiate the busco-guided single-copy gene family analysis, if the flag was set, the analysis would be initiated
-bhmm, --buscohmm, the HMM profile datafile in the busco-guided single-copy gene family analysis, default None
-bctf, --buscocutoff, the HMM score cutoff datafile in the busco-guided single-copy gene family analysis, default None
-of ,--ogformat, flag option, whether to add index to the RBH families

The program wgd focus can realize the concatenation-based and coalescence-based phylogenetic inference and phylogenetic dating of WGDs etc.

wgd focus families sequences (option)
--------------------------------------------------------------------------------
-o, --outdir, the output directory, default wgd_focus
-t, --tmpdir, the temporary working directory, default None, if None was given, the tmpdir will be assigned random names in current directory and automately removed at the completion of program, else the tmpdir would be kept
-n, --nthreads, the number of threads to use, default 4
--to_stop, flag option, whether to translate through STOP codons, if the flag was set, translation will be terminated at the first in-frame stop codon, else a full translation continuing on past any stop codons would be initiated
--cds, flag option, whether to only translate the complete CDS that starts with a valid start codon and only contains a single in-frame stop codon at the end and must be dividable by three, if the flag was set, only the complete CDS would be translated
--strip_gaps, flag option, whether to drop all gaps in multiple sequence alignment, if the flag was set, all gaps would be dropped
-a, --aligner, which alignment program to use, default mafft
-tree, --tree_method, which gene tree inference program to invoke, default fasttree
-ts, --treeset, the parameters setting for gene tree inference, default None, this option can be provided multiple times
--concatenation, flag option, whether to initiate the concatenation-based species tree inference, if the flag was set, concatenation-based species tree would be infered
--coalescence, flag option, whether to initiate the coalescence-based species tree inference, if the flag was set, coalescence-based species tree would be infered
-sp, --speciestree, species tree datafile for dating, default None
-d, --dating, which molecular dating program to use, default none
-ds, --datingset, the parameters setting for dating program, default None, this option can be provided multiple times
-ns, --nsites, the nsites information for r8s dating, default None
-ot, --outgroup, the outgroup information for r8s dating, default None
-pt, --partition, flag option, whether to initiate partition dating analysis for codon, if the flag was set, an additional partition dating analysis would be initiated
-am, --aamodel, which protein model to be used in mcmctree, default poisson
-ks, flag option, whether to initiate Ks calculation for homologues in the provided orthologous gene family
--annotation, which annotation program to use, default None
--pairwise, flag option, whether to initiate pairwise Ks estimation, if the flag was set, pairwise Ks values would be estimated
-ed, --eggnogdata, the eggnog annotation datafile, default None
--pfam, which option to use for pfam annotation, default None
--dmnb, the diamond database for annotation, default None
--hmm, the HMM profile for annotation, default None
--evalue, the e-value cut-off for annotation, default 1e-10
--exepath, the path to the interproscan executable, default None
-f, --fossil, the fossil calibration information in Beast, default ('clade1;clade2', 'taxa1,taxa2;taxa3,taxa4', '4;5', '0.5;0.6', '400;500')
-rh, --rootheight, the root height calibration info in Beast, default (4,0.5,400)
-cs, --chainset, the parameters of MCMC chain in Beast, default (10000,100)
--beastlgjar, the path to beastLG.jar, default None
--beagle, flag option, whether to use beagle in Beast, if the flag was set, beagle would be used
--protcocdating, flag option, whether to only initiate the protein-concatenation-based dating analysis, if the flag was set, the analysis would be initiated
--protdating, flag option, whether to only initiate the protein-based dating analysis, if the flag was set, the analysis would be initiated

The program wgd ksd can realize the construction of KS age distribution and rate correction.

wgd ksd families sequences (option)
--------------------------------------------------------------------------------
-o, --outdir, the output directory, default wgd_ksd
-t, --tmpdir, the temporary working directory, default None, if None was given, the tmpdir will be assigned random names in current directory and automately removed at the completion of program, else the tmpdir would be kept
-n, --nthreads, the number of threads to use, default 4
--to_stop, flag option, whether to translate through STOP codons, if the flag was set, translation will be terminated at the first in-frame stop codon, else a full translation continuing on past any stop codons would be initiated
--cds, flag option, whether to only translate the complete CDS that starts with a valid start codon and only contains a single in-frame stop codon at the end and must be dividable by three, if the flag was set, only the complete CDS would be translated
--pairwise, flag option, whether to initiate pairwise Ks estimation, if the flag was set, pairwise Ks values would be estimated
--strip_gaps, flag option, whether to drop all gaps in multiple sequence alignment, if the flag was set, all gaps would be dropped
-a, --aligner, which alignment program to use, default mafft 
-tree, --tree_method, which gene tree inference program to invoke, default fasttree
--tree_options, options in tree inference as a comma separated string, default None
--node_average, flag option, whether to initiate node-average way of de-redundancy instead of node-weighted, if the flag was set, the node-averaging de-redundancy would be initiated
-sr, --spair, the species pair to be plotted, default None, this option can be provided multiple times
-sp, --speciestree, the species tree to perform rate correction, default None, if None was given, the rate correction analysis would be called off
-rw, --reweight, flag option, whether to recalculate the weight per species pair, if the flag was set, the weight would be recalculated
-or, --onlyrootout, flag option, whether to only conduct rate correction using the outgroup at root as outgroup, if the flag was set, only the outgroup at root would be used as outgroup
-epk, --extraparanomeks, extra paranome Ks data to plot in the mixed Ks distribution, default None
-ap, --anchorpoints, anchorpoints.txt file to plot anchor Ks in the mixed Ks distribution, default None
-pk, --plotkde, flag option, whether to plot kde curve of orthologous Ks distribution over histogram in the mixed Ks distribution, if the flag was set, the kde curve would be plotted
-pag, --plotapgmm, flag option, whether to perform and plot mixture modeling of anchor Ks in the mixed Ks distribution, if the flag was set, the mixture modeling of anchor Ks would be plotted
-pem, --plotelmm, flag option, whether to perform and plot elmm mixture modeling of paranome Ks in the mixed Ks distribution, if the flag was set, the elmm mixture modeling of paranome Ks would be plotted
-c, --components, the range of the number of components to fit in anchor Ks mixture modeling, default (1,4)
-xl, --xlim, the x axis limit of Ks distribution
-yl, --ylim, the y axis limit of Ks distribution
-ado, --adjustortho, flag option, whether to adjust the histogram height of orthologous Ks as to match the height of paralogous Ks, if the flag was set, the adjustment would be conducted
-adf, --adjustfactor, the adjustment factor of orthologous Ks, default 0.5
-oa, --okalpha, the opacity of orthologous Ks distribution in mixed plot, default 0.5
-fa, --focus2all, set focal species and let species pair to be between focal and all the remaining species, default None
-ks, --kstree, flag option, whether to infer Ks tree, if the flag was set, the Ks tree inference analysis would be initiated
-ock, --onlyconcatkstree, flag option, whether to only infer Ks tree under concatenated alignment, if the flag was set, only the Ks tree under concatenated alignment would be calculated
-cs, --classic, flag option, whether to draw mixed Ks plot in a classic manner where the full orthologous Ks distribution is drawed, if the flag was set, the classic mixed Ks plot would be drawn
-ta, --toparrow, flag option, whether to adjust the arrow to be at the top of the plot, instead of being coordinated as the KDE of the orthologous Ks distribution, if the flag was set, the arrow would be set at the top
-bs, --bootstrap, the number of bootstrap replicates of ortholog Ks distribution in mixed plot

The program wgd mix can realize the mixture model clustering analysis of KS age distribution.

wgd mix ks_datafile (option)
--------------------------------------------------------------------------------
-f, --filters, the cutoff alignment length, default 300
-r, --ks_range, the Ks range to be considered, default (0, 5)
-b, --bins, the number of bins in Ks distribution, default 50
-o, --outdir, the output directory, default wgd_mix
--method, which mixture model to use, default gmm
-n, --components, the range of the number of components to fit, default (1, 4)
-g, --gamma, the gamma parameter for bgmm models, default 0.001
-ni, --n_init, the number of k-means initializations, default 200
-mi, --max_iter, the maximum number of iterations, default 200

The program wgd peak can realize the search of crediable KS range used in WGD dating.

wgd peak ks_datafile (option)
--------------------------------------------------------------------------------
-ap, --anchorpoints, the anchor points datafile, default None
-sm, --segments, the segments datafile, default None
-le, --listelements, the listsegments datafile, default None 
-mp, --multipliconpairs, the multipliconpairs datafile, default None
-o, --outdir, the output directory, default wgd_peak
-af, --alignfilter, cutoff for alignment identity, length and coverage, default 0.0, 0, 0.0
-r, --ksrange, range of Ks to be analyzed, default (0, 5)
-bw, --bin_width, bandwidth of Ks distribution, default 0.1
-ic, --weights_outliers_included, flag option, whether to include Ks outliers, if the flag was set, Ks outliers would be included in the analysis
-m, --method, which mixture model to use, default gmm
--seed, random seed given to initialization, default 2352890
-ei, --em_iter, the number of EM iterations to perform, default 200
-ni, --n_init, the number of k-means initializations, default 200
-n, --components, the range of the number of components to fit, default (1, 4)
-g, --gamma, the gamma parameter for bgmm models, default 1e-3
--boots, the number of bootstrap replicates of kde, default 200
--weighted, flag option, whether to use node-weighted method of de-redundancy, if the flag was set, the node-weighted method would be used
-p, --plot, the plotting method to be used, default identical
-bm, --bw_method, the bandwidth method to be used in analyzing the peak of WGD dates, default silverman
--n_medoids, the number of medoids to fit, default 2
-km, --kdemethod, the kde method to be used in analyzing the peak of WGD dates, kmedoids analysis or the basic Ks plotting, default scipy
--n_clusters, the number of clusters to plot Elbow loss function, default 5
-gd, --guide, the regime residing anchors, default Segment
-prct, --prominence_cutoff, the prominence cutoff of acceptable peaks in peak finding steps, default 0.1
-rh, --rel_height, the relative height at which the peak width is measured, default 0.4
-kd, --kstodate, the range of Ks to be dated in heuristic search, default (0.5, 1.5)
-xl, --xlim, the x axis limit of GMM Ks distribution
-yl, --ylim, the y axis limit of GMM Ks distribution
--manualset, flag option, whether to output anchor pairs with manually set Ks range, if the flag was set, manually set Ks range would be outputted
--ci, the confidence level of log-normal distribution to date, default 95
--hdr, the highest density region (HDR) applied in the segment-guided anchor pair Ks distribution, default 95
--heuristic, flag option, whether to initiate heuristic method of defining CI for dating, if the flag was set, the heuristic method would be initiated
-kc, --kscutoff, the Ks saturation cutoff in dating, default 5
--keeptmpfig, flag option, whether to keep temporary figures in peak finding process, if the flag was set, those figures would be kept

The program wgd syn can realize the intra- and inter-specific synteny inference.

wgd syn families gffs (option)
--------------------------------------------------------------------------------
-ks, --ks_distribution, ks distribution datafile, default None
-o, --outdir, the output directory, default wgd_syn
-f, --feature, the feature for parsing gene IDs from GFF files, default gene
-a, --attribute, the attribute for parsing the gene IDs from the GFF files, default ID
-atg, --additionalgffinfo, the feature and attribute information of additional gff3 files if different in the format of (feature;attribute)', default None
-ml, --minlen, the minimum length of a scaffold to be included in dotplot, default -1, if -1 was set, the 10% of the longest scaffold would be set
-ms, --maxsize, the maximum family size to be included, default 200
-r, --ks_range, the Ks range in colored dotplot, default (0, 5)
--pathiadhore, the path to the i-adhore executable, which can be simply igored if the i-adhore can already be properly called, default None
--iadhore_options, the parameter setting in iadhore, default as a string of length zero
-mg, --minseglen, the minimum length of segments to include in ratio if <= 1, default 10000
-kr, --keepredun, flag option, whether to keep redundant multiplicons, if the flag was set, the redundant multiplicons would be kept
-mgn, --mingenenum, the minimum number of genes for a segment to be considered, default 30
-ds, --dotsize, the dot size in dot plot, default 0.3
-aa, --apalpha, the opacity of anchor dots, default 1
-ha, --hoalpha, the opacity of homolog dots, default 0
-srt, --showrealtick, flag option, whether to show the real tick in genes or bases, if the flag was set, the real tick would be showed
-tls, --ticklabelsize, the label size of tick, default 5
-gr, --gistrb, flag option, whether to use gist_rainbow as color map of dotplot
-n, --nthreads, the number of threads to use in synteny inference, default 4

The program wgd viz can realize the visualization of KS age distribution and synteny.

wgd viz (option)
--------------------------------------------------------------------------------
-d, --datafile, the Ks datafile, default None
-o, --outdir, the output directory, default wgd_viz
-sr, --spair, the species pair to be plotted, default None, this option can be provided multiple times
-fa, --focus2all, set focal species and let species pair to be between focal and all the remaining species, default None
-gs, --gsmap, the gene name-species name map, default None
-sp, --speciestree, the species tree to perform rate correction, default None, if None was given, the rate correction analysis would be called off
-pk, --plotkde, flag option, whether to plot kde curve upon histogram, if the flag was set, kde curve would be added
-rw, --reweight, flag option, whether to recalculate the weight per species pair, if the flag was set, the weight would be recalculated
-or, --onlyrootout, flag option, whether to only conduct rate correction using the outgroup at root as outgroup, if the flag was set, only the outgroup at root would be used as outgroup
-iter, --em_iterations, the maximum EM iterations, default 200
-init, --em_initializations, the maximum EM initializations, default 200
-prct, --prominence_cutoff, the prominence cutoff of acceptable peaks, default 0.1
-rh, --rel_height, the relative height at which the peak width is measured, default 0.4
-sm, --segments, the segments datafile, default None
-ml, --minlen, the minimum length of a scaffold to be included in dotplot, default -1, if -1 was set, the 10% of the longest scaffolds will be set
-ms, --maxsize, the maximum family size to be included, default 200
-ap, --anchorpoints, the anchor points datafile, default None
-mt, --multiplicon, the multiplicons datafile, default None
-gt, --genetable, the gene table datafile, default None
-mg, --minseglen, the minimum length of segments to include, in ratio if <= 1, default 10000
-mgn, --mingenenum, the minimum number of genes for a segment to be considered, default 30
-kr, --keepredun, flag option, whether to keep redundant multiplicons, if the flag was set, the redundant multiplicons would be kept
-epk, --extraparanomeks, extra paranome Ks data to plot in the mixed Ks distribution, default None
-pag, --plotapgmm, flag option, whether to conduct and plot mixture modeling of anchor Ks in the mixed Ks distribution, if the flag was set, the mixture modeling of anchor Ks would be conducted and plotted
-pem, --plotelmm, flag option, whether to conduct and plot elmm mixture modeling of paranome Ks in the mixed Ks distribution, if the flag was set, the elmm mixture modeling of paranome Ks would be conducted and plotted
-c, --components, the range of the number of components to fit in anchor Ks mixture modeling, default (1,4)
-psy, --plotsyn, flag option, whether to initiate the synteny plot, only when the flag was set, the synteny plot would be produced
-ds, --dotsize, the dot size in dot plot, default 0.3
-aa, --apalpha, the opacity of anchor dots, default 1
-ha, --hoalpha, the opacity of homolog dots, default 0
-srt, --showrealtick, flag option, whether to show the real tick in genes or bases, if the flag was set, the real tick would be showed
-tls, --ticklabelsize, the label size of tick, default 5
-xl, --xlim, the x axis limit of Ks distribution
-yl, --ylim, the y axis limit of Ks distribution
-ado, --adjustortho, flag option, whether to adjust the histogram height of orthologous Ks as to match the height of paralogous Ks, if the flag was set, the adjustment would be conducted
-adf, --adjustfactor, the adjustment factor of orthologous Ks, default 0.5
-oa, --okalpha, the opacity of orthologous Ks distribution in mixed plot, default 0.5
-cs, --classic, flag option, whether to draw mixed Ks plot in a classic manner where the full orthologous Ks distribution is drawed, if the flag was set, the classic mixed Ks plot would be drawn
-ta, --toparrow, flag option, whether to adjust the arrow to be at the top of the plot, instead of being coordinated as the KDE of the orthologous Ks distribution, if the flag was set, the arrow would be set at the top
-na, --nodeaveraged, flag option, whether to use node-averaged method for de-redundancy, if the flag was set, the node-averaged method would be initiated
-bs, --bootstrap, the number of bootstrap replicates of ortholog Ks distribution in mixed plot
-gr, --gistrb, flag option, whether to use gist_rainbow as color map of dotplot
-n, --nthreads, the number of threads to use in bootstrap sampling, default 1

Usage

Here we provided the basic usage for each program and the relevant parameters and suggestions on parameterization. A reminder that the given cores and threads can significantly impact the run time and thus we added some report information pertaining to the system of users to facilitate the efficient setting of threads and memory. The logical CPUs reported represents the number of physical cores multiplied by the number of threads that can run on each core, also known as Hyper Threading. The number of logical CPUs may not necessarily be equivalent to the actual number of CPUs the current process can use. The available memory refers to the memory that can be given instantly to processes without the system going into swap and reflects the actual memory available. The free memory refers to the memory not being used at all (zeroed) that is readily available. The description above refers to the documentation of psutil.

wgd dmd

The delineation of whole paranome

wgd dmd Aquilegia_coerulea -I 2 -e 1e-10 -bs 100 -np 5 (-nn) (--to_stop) (--cds) (-n 4) (-o wgd_dmd) (-t working_tmp)

Note that we don't provide the data of this coding sequence (cds) file Aquilegia_coerulea but it can be easily downloaded at Phytozome (same for other Usage doc). A reminder that in the issues some users didn't download the Acoerulea_322_v3.1.cds_primaryTranscriptOnly.fa.gz but instead the Acoerulea_322_v3.1.cds.fa.gz file. For the construction of whole paranome KS distribution, only one transcript (the primary one) per gene should be included such that the KS is really indicating the age of gene duplication event, instead of alternative splicing. Transcriptome data should be carefully treated with de-redundancy so as to reduce the false positive duplication bump caused by pervasive alternative splicing. Here the inflation factor parameter, given by -I or --inflation, affects the granularity or resolution of the clustering outcome and implicitly controlls the number of clusters, with low values such as 1.3 or 1.4 leading to fewer but larger clusters and high values such as 5 or 6 leading to more but smaller clusters. We set the default value as 2 as suggested by MCL. The e-value cut-off for sequence similarity, given by -e or --eval, which denotes the expected value of the hit quantifies the number of alignments of similar or better quality that you expect to find searching this query against a database of random sequences the same size as the actual target database, is the key parameter measuring the significance of a hit, which is set here as default 1e-10. Note that DIAMOND itself by default only reports all alignments with e-value < 0.001. The percentage of upper hits used for gene length normalization, given by -np or --normalizedpercent, which determines the upper percentile of hits per bin (categorized by gene length) used in the fit of linear regression, considering that not all hits per bin show apparent linear relationship, is set as default 5, indicating the usage of top 5% hits per bin. The number of bins divided in gene length normalization, given by -bs or --bins, determines the number of bins to categorize the gene length, is set as default 100. The parameter -nn or --nonormalization can be set to call off the normalization process, although it's suggested to conduct the normalization to acquire more accurate gene family clustering result. The parameters --to_stop and --cds control the behaviour of translating coding sequence into amino acid sequence. If the --to_stop was set, the translation would be terminated at the first in-frame stop codon, otherwise the translation would simply skip any stop codons. If the --cds was set, sequences that doesn't start with a valid start codon, or contains more than one in-frame stop codon, or is not dividable by 3, would be simply dropped, such that only strict complete coding sequences would be included in the subsequent analysis. The exact behaviour of --to_stop and --cds is defined and described in the biopython library. The number of parallel threads by the option -n or --nthreads can be set to accelerate the calculation within diamond. The directory of output or intermediate files is determined by the parameter -o or --outdir, and -t or --tmpdir, which will be created by the program itself and be overwritten if the folder has already been created. Note that the software diamond should be pre-installed and set in the environment path in all the analysis performed by wgd dmd except for the collinear coalescence analysis.

We suggest that the default setting in which the inflation factor is set as 2, e-value cut-off as 1e-10 and other parameters in default is a good starting point, unless you specifically want to explore the effects of different parameters. Such that the command for the delineation of whole paranome is simply as below.

wgd dmd Aquilegia_coerulea

The delineation of RBHs

wgd dmd sequence1 sequence2 -e 1e-10 -bs 100 -np 5 (-nn) (-n 4) (-c 0.9) (--ogformat) (--to_stop) (--cds) (-o wgd_dmd) (-t working_tmp)

To delineate RBHs between two cds sequence files, the relevant parameter is mostly the same as whole paranome inference, except for the parameter -c or --cscore, which ranges between 0 and 1 and is used to relax the similarity cutoff from the exclusive reciprocal best hits to a certain ratio as to the best hits. For instance, if the gene b1 from genome B has the best hit gene a1 from genome A with the bit score as 100, which is a scoring matrix independent measure of the (local) similarity of the two aligned sequences, with larger values indicating higher similarities, given the -c 0.9, genes from genome A which has the bit score with gene b1 higher than 0.9x100 will also be written in the result file, which in a sense are not RBHs anymore of course, but the highly similar homologue pairs. If more than 2 sequence files were provided, every pair-wise RBHs would be calculated except for querying the same sequence itself. The number of parallel threads to booster the running speed can be set by the option -n or --nthreads which is suggested to be set as (N-1)N/2 where N is the number of cds files to achieve the highest efficiency. The option --ogformat can be set to add index (for instance GF00000001) to the output RBH gene families which can be further used in the KS calculation by wgd ksd.

The suggested command to start with is also under the default setting with the command shown below

wgd dmd sequence1 sequence2

The delineation of local MRBHs

wgd dmd sequence1 sequence2 sequence3 -f sequence1 -e 1e-10 -bs 100 -np 5 (-nn) (-n 4) (-kf) (-kd) (-c 0.9) (--to_stop) (--cds) (-o wgd_dmd) (-t working_tmp)

The distinction between local and global (hereunder) MRBHs is that local MRBHs are the results of merged RBHs on a joint focal species, for instance in a three species system (A,(B,C)), the local MRBHs of A only require the calculation of RBHs between A and C (denoted as AC) and AB and then the merging of AB and AC at the axis of A, while the gloabl MRBHs are independent of focal species in that it just calls the calculations of all possible species pair (not self to self species pair), such that AB, AC, and BC are all to be calculated and merged.

Two types of MRBHs as intepretated above can be delineated by wgd dmd, the local MRBHs and the global MRBHs. The local MRBHs are constructed by merging all the relevant RBHs only with the focal species, which is set by -f or --focus. The parameter -kf or --keepfasta can be set to retain the sequence information of each MRBH. The parameter -kd or --keepduplicates determines whether the same genes can appear in different local MRBHs. Normally there will be no duplicates in the local MRBHs but if users set the -c as 0.9 (or any value smaller than 1), it's likely that the same gene will have chance to appear multiple times in different local MRBHs. That is to say, the parameter -kd is meaningful only when it's set together with the parameter -c. The number of parallel threads is suggested to be set as the number of cds files minus 1.

A suggested starting run is under the default parameter with the command shown as below.

wgd dmd sequence1 sequence2 sequence3 -f sequence1

The delineation of global MRBHs

wgd dmd sequence1 sequence2 sequence3 -gm -e 1e-10 -bs 100 -np 5 (-nn) (-n 4) (-kf) (-kd) (-c 0.9) (--to_stop) (--cds) (-o wgd_dmd) (-t working_tmp)

The global MRBHs is constructed by exhaustively merging all the possible pair-wise RBHs except for querying the sequence itself, which can be initiated by add the flag -gm or --globalmrbh. The rest of relevant parameters stays the same as the local MRBHs. The number of parallel threads is suggested to be set as (N-1)N/2 too where N is the number of cds files to achieve the highest efficiency.

A suggested starting run is under the default parameter with the command shown as below.

wgd dmd sequence1 sequence2 sequence3 -gm

The delineation of orthogroups

wgd dmd sequence1 sequence2 sequence3 -oi -oo -e 1e-10 -bs 100 -np 5 (-nn) (-cc) (-te) (-mc 0.8) (-gn) (-tree 'fasttree') (-ts '-fastest') (-n 4) (--to_stop) (--cds) (-o wgd_dmd) (-t working_tmp)

In wgd v2, we also implemented an algorithm of delineating orthogroups, which can be initiated with the parameter -oi or --orthoinfer. Two ways of delineation can be chosen, the concatenation way (set by the parameter -cc or --concat) or the non-concatenation (default) way. In brief, the concatenation way of delineating orthogroups starts with concatenating all the sequences into a single sequence file and then inferring the whole paranome of this single sequence file with the clustering results mapped back to the belonging species. While the non-concatenation way starts with respective pair-wise diamond search (including querying the same sequence itself) and then all the sequence similarity tables will be concatenated and clustered into orthogroups. Some other possibly useful post-clustering functions can be initiated, including the parameter -te or --testsog, which can be set to start the unbiased test of single-copy gene families (note that this function needs hmmer (v3.1b2) to be installed in the environment path), the parameter -mc or --msogcut, ranging between 0 to 1, which can be set to search the so-called mostly single-copy family which has higher than certain cut-off percentage of species coverage, the parameter -gn or --getnsog, which can be set to search for nested single-copy gene families (NSOGs) which is originally multiy-copy but has a (mostly) single-copy branch (which requires the chosen tree-inference program set by -tree or --tree_method to be pre-installed in the environment path with the parameters setting for gene tree inference controlled by -ts or --treeset). The program wgd dmd would still conduct the RBHs calculation unless the parameter -oo or --onlyortho was set. If one only wants to infer the orthogroups, it's suggested to add the flag -oo to just implement the orthogroups delineation analysis. The number of parallel threads is suggested to be set as (N+1)N/2 where N is the number of cds files to achieve the highest efficiency since the self-comparison is also included.

The default setting of parameters is a reasonable starting point with the command as below.

wgd dmd sequence1 sequence2 sequence3 -oi -oo

The collinear coalescence inference of phylogeny

wgd dmd sequence1 sequence2 sequence3 -ap apdata -sm smdata -le ledata -gt gtdata -coc (-tree 'fasttree') (-ts '-fastest') (-n 4) (--to_stop) (--cds) (-o wgd_dmd) (-t working_tmp)

A novel phylogenetic inference method named "collinear coalescence inference" is also implemented in wgd v2. For this analysis, users need to provide the anchor points file by -ap or --anchorpoints, the collinear segments file by -sm or --segments, the listsegments file by -le or --listelements, and the gene table file by -gt or --genetable, all of which can be produced in the program wgd syn. The parameter -coc or --collinearcoalescence needs to be set to start this analysis. The tree-inference program and the associated parameters can be set just as above by -tree or --tree_method and -ts or --treeset. Please also make sure the chosen tree-inference program is installed in the environment path. The program astral-pro is required to be installed in the environment path too. Note that there should be no duplicated gene IDs in the sequence file. The parallel threads here are to accelerate the sequence alignment and gene tree inference for each gene family and thus suggested to be set as much as the number of gene families.

A suggested starting run of this analysis is with the simple command below.

wgd dmd sequence1 sequence2 sequence3 -ap apdata -sm smdata -le ledata -gt gtdata -coc

wgd ksd

The construction of whole paranome KS age distribution

wgd ksd families sequence (-o wgd_ksd -t wgd_ksd_tmp --nthreads 4 --to_stop --cds --pairwise --strip_gaps --aligner mafft --tree_method fasttree --node_average)

The program wgd ksd, as impied by its name, is for the construction of KS age distribution. Except for the aforementioned parameters such as --to_stop and --cds, there are some important parameters that have crucial impact on the KS estimation. The option --pairwise is a very important parameter for the KS estimation, with which the CODEML will calculate the KS for each gene pair separately based on the alignment of only these two genes instead of the whole alignment of the family, such that less gaps are expected and thus the alignment in the consideration of CODEML will be longer (because CODEML will automately skip every column with gap, regardless of whether it's an overall or partial gap), without which the CODEML will calculate the KS based on the whole alignment of the family, which might have no KS result at all if the stripped alignment length (removing all gap-containing columns) was zero, a cause of different number of KS estimates between "pairwise mode" and "non-pairwise mode". It's difficult to say which mode is more ideal, although the "non-pairwise mode" (default setting) which runs on the whole alignment instead of a local alignment, might be more biologically conserved in that it assures the evolution of each column to be started from the root of the family and all the gene duplicates are taken into account in the KS estimation process. The option --strip_gaps can remove all the gap-containing columns, with or without which the result of "non-pairwise mode" won't be affected, while with which the result of "pairwise mode" will be altered. The option --aligner and --aln_options which decide which alignment program to be used and which parameter to be set, will have impact on the KS results, noted that the default program is mafft and the parameter is --auto. The option --tree_method and --tree_options decide which gene tree inference program to be used and which parameter to be set, won't affect the KS estimation itself but the result of de-redundancy, noted that we implemented a built-in gene tree inference method based on the Average Linkage Clustering (ALC) (thus a distance-based tree) with the --tree_method set as "cluster". Two methods of de-redundancy are implemented in wgd v2, namely node-weighted and node-averaged methods. The node-weighted method achieves the de-redundancy via weighing the KS value associated with each gene pair such that the weight of a single duplication event sums up to 1 (noted that the number of KS estimation remains the same) while the node-averaged method realizes the de-redundancy via calculating per gene tree node one averaged KS value to represent the age of each duplication event. The option --node_average can be set to choose the node-averaged way of de-redundancy. Different methods of de-redundancy will have impact on the detection of WGD signals, which has been investigated in this literature. The parallel threads here are to parallelize the analysis for each gene family and thus suggested to be set as much as the number of gene families.

A suggested starting run can use command as below

wgd ksd families sequence

The construction of orthologous KS age distribution

wgd ksd families sequence1 sequence2 sequence3 (--reweight)

From paralogous to orthologous KS age distribution, users only need to provide more cds files. Note that with orthologous gene families the weighting method can be set to be calculated per species pair instead of considering the whole family because when plotting orthologous KS age distribution between two species the weight calculated from this specific species pair should be conserved while the one calculated from the whole family will vary with the number of species. To initiate the weighting per species pair, the option --reweight can be set.

A suggested starting run can use command as below

wgd ksd families sequence1 sequence2 sequence3

The construction of KS age distribution with rate correction

wgd ksd families sequence1 sequence2 sequence3 --focus2all sequence1 -sp spdata --extraparanomeks paranomeKsdata (--plotelmm --plotapgmm --anchorpoints apdata --reweight --onlyrootout)

Inspired by the rate correction algorithm in ksrates, we implemented the rate correction analysis also in wgd v2, which is mostly the same as ksrates but differs in the calculation of the standard deviation of rescaled KS ages. To perform the rate correction analysis, users can use both the wgd ksd and wgd viz program. For the wgd ksd program, users need to provide a species tree via the option --speciestree on which rate correction can be conducted. Note that unnecessary brackets might lead to unexpected errors, for instance a tree (A,(B,C)); should not be represented as (A,((B,C)));. The set of species pairs to be shown is flexible that the most convenient option is --focus2all which simply shows all the possible focal-sister species pairs, or users can manually set the species pairs via the option --spair. Note that if the species pairs were manually set, it would be needed to co-set the option --classic. The option --onlyrootout can be set to only consider outgroup at the root, instead of all the possible outgroups per focal-sister species pair, which has impact on the final corrected KS ages. We suggest of using all the possible outgroups per focal-sister species pair as for a less biased result. The paranome KS data should be provided via the option --extraparanomeks.

Some other options which have no impact on the rate correction but add more layers or change the appearance on the mixed KS age distribution, include --plotapgmm and --plotelmm etc. The option --plotapgmm can be set to call the GMM analysis upon the anchor pair KS age distribution and plot the clustering result upon the mixed KS age distribution, which has to be co-set with the option --anchorpoints providing the anchor pairs information. The option --plotelmm can be set to call the ELMM analysis upon the whole paranome KS age distribution. Note that the species names present in the species tree file should match the names of the corresponding sequence files For instance, given the cds file names 'A.cds','B.cds','C.cds', the species tree could be '(A.cds,(B.cds,C.cds));' rather than '(A,(B,C));'. There is no requirement for the name of the paranome KS datafile which can be named in whatever manner users prefer. The 'GMM' is the abbreviation of Gaussian Mixture Modeling while the 'ELMM' refers to Exponential-Lognormal Mixture Modeling as ksrates interprets.

There are 21 columns in the result .ks.tsv file besides the index columns pair as the unique identifier for each gene pair. The N, S, dN, dN/dS, dS, l and t are from the codeml results, representing the N estimate, the S estimate, the dN estimate, the dN/dS (omega) estimate, the dS estimate, the log-likelihood and the t estimate, respectively. The alignmentcoverage, alignmentidentity and alignmentlength are the information pertaining to the alignment for each family, representing the ratio of the stripped alignment length compared to the full alignment length, the ratio of columns with identical nucleotides compared to the overall columns of the stripped alignment, and the length of the full alignment, respectively.

wgd syn

The intra-specific synteny inference

wgd syn families gff (--ks_distribution ksdata -f gene -a ID --minlen -1 --minseglen 10000 --mingenenum 30)

The program wgd syn is mainly dealing with collinearity or synteny (both referred to as synteny hereafter) analysis. Two input files are essential, the gene family file and the gff3 file. The gene family file is in the format as OrthoFinder. The software i-adhore is a prerequisite. With default parameters, the program basically conducts 1) filtering gene families based on maximum family size 2) retrieving gene position and scaffold information from gff3 file 3) producing the configuration file and associated datafiles for i-adhore 4) calling i-adhore given the parameters set to infer synteny 5) visualizing the synteny in "dotplot" in the unit of genes and bases, in "Syndepth" plot showing the distribution of different categories of collinearity ratios within and between species, in "dupStack" plot showing multiplicons with different multiplication levels. 6) if with KS data, a "KS dotplot" with dots annotated in KS values and a KS distribution with anchor pairs denoted will be produced. The gene information in the gene family file and gff3 file should be matched which requires users to set proper --feature and --attribute. The maximum family size to be included can be set via the option --maxsize, noted that this filtering is mainly to drop those huge tandem duplicates family and transposable elements (TEs) family, and not mandatory. Users can filter those fragmentary scaffolds via the option --minlen. The minimum length and number of genes for a segment to be considered can be set via the option --minseglen and --mingenenum. Redundant multiplicons can be kept by set the flag option --keepredun.

A suggested starting run can use command simply as below

wgd syn families gff

The inter-specific synteny inference

wgd syn families gff1 gff2 (--additionalgffinfo "mRNA;Name" --additionalgffinfo "gene;ID")

For multi-species synteny inference, if users have gff3 files which have different features or attributes for gene position information retrieval, the option --additionalgffinfo can be set to provide the additional information. The remaining parameter setting is the same as the intra-specific synteny inference.

A suggested starting run can use command simply as below

wgd syn families gff1 gff2

wgd viz

The visualization of KS age distribution and ELMM analysis

wgd viz -d ksdata

The program wgd viz is mainly for the purpose of KS distribution and synteny visualization, with some optional mixture modeling analysis. The basic function is just to plot the KS distribution and conduct an ELMM analysis in search of potential WGD components. Some key parameters affecting the ELMM result include --prominence_cutoff and --rel_height, which have been explained ahead, --em_iterations and --em_initializations determining the maximum iterations and initializations in the EM algorithm.

A suggested starting run can use command simply as below

wgd viz -d ksdata

The visualization of KS age distribution with rate correction

wgd viz -d ksdata -sp spdata --focus2all focal_species --extraparanomeks ksdata (--anchorpoints apdata --plotapgmm --plotelmm)

Besides the basic KS plot, substitution rate correction can also be achieved given at least a species tree (via the option --speciestree) and a focal species (either via the option --focus2all or via the option --spair in the form of "$focal_species;$focal_species"). It's suggested that the orthologous KS data is provided by the --datafile option and the paralogous KS data is provided by the --extraparanomeks option, although it's allowed to only provide KS data via the --datafile option and deposit both orthologous and paralogous KS data thereon. There are two types of mixed plots (the "mixed" here refers to mixed orthologous and paralogous KS distributions), one of which is similar to what ksrates plots, while the other of which is like the conventional KS plot that both the original orthologous and paralogous KS distributions are truthfully plotted instead of just being represented by some vertical lines. We suggest users to adopt the ksrates-like plots, which is the default. Otherwise the option --classic will be needed to set. Extra mixture modeling analysis can be initiated via the option --plotelmm and --plotapgmm with the anchor points datafile provided by the --anchorpoints option.

A suggested starting run can use command simply as below

wgd viz -d ksdata -sp spdata --focus2all focal_species --extraparanomeks ksdata

The visualization of synteny

wgd viz -ap apdata -sm smdata -mt mtdata -gt gtdata --plotsyn (--minlen -1 --minseglen 10000 --mingenenum 30)

Compared to the original wgd, the wgd viz program added synteny visualization pipeline. Users need to provide the flag option --plotsyn to initiate this part of pipeline. This step assumes that users have obtained already the syntenic results from i-adhore and uses those result files to realize the synteny visualization. Simliar to wgd syn, the extra KS data file can be transmitted via the option --datafile (instead of --extraparanomeks option). Basically, the syntenic result files required are the anchor points datafile, multiplicons datafile, gene table datafile (automately produced by wgd syn), and segments datafile.

A suggested starting run can use command simply as below

wgd viz -ap apdata -sm smdata -mt mtdata -gt gtdata --plotsyn

wgd mix

The mixture model clustering analysis of KS age distribution

wgd mix ksdata (--n_init 200 --max_iter 200 --ks_range 0 5 --filters 300 --bins 50 --components 1 4 --gamma 0.001)

This part of Gaussian mixture modeling (GMM) analysis is inherited from the original wgd program, but writes additionally the probability of each KS value into the final dataframe. Basically, users need to provide with a (normally from whole-paranome or anchor-pairs) KS datafile and the GMM analysis will be conducted upon the datafile. Some parameters can affect the results, including --n_init, which sets the number of k-means initializations (default 200), --max_iter, which sets the maximum number of iterations (default 200), --method, which determines which clustering method to use (default gmm), --gamma, which sets the gamma parameter for the bgmm model (default 0.001), --components, which sets the range of the number of components to fit (default 1 4), and the data filtering parameters --filters which filters data based on alignment length, --ks_range which filters data based on KS values and the parameter --bins which sets the number of bins in KS distribution (default 50).

A suggested starting run can use command simply as below

wgd mix ksdata

wgd peak

The search of crediable KS range used in WGD dating

wgd peak ksdata -ap apdata -sm smdata -le ledata -mp mpdata --heuristic (--alignfilter 0.0 0 0.0 --ksrange 0 5 --bin_width 0.1 --guide segment --prominence_cutoff 0.1 --rel_height 0.4 --ci 95 --hdr 95 --kscutoff 5)

As mentioned previously, a heuristic method and a collinear segments-guided anchor pair clustering for the search of crediable KS range used in WGD dating are implemented in wgd v2. Users need to provide the anchor points, segments, listsegments, multipliconpairs datafile from i-adhore to achieve the clustering function. Some parameters that can impact the results include --alignfilter, which filters the data based on alignment identity, length and coverage, --ksrange, which sets the range of Ks to be analyzed, --bin_width, which sets the bandwidth of KS distribution, --weights_outliers_included which determines whether to include KS outliers (whose value is over 5) in analysis, --method which determines which clustering method to use (default gmm), --seed which sets the random seed given to initialization (default 2352890), --n_init, which sets the number of k-means initializations (default 200), --em_iter, which sets the maximum number of iterations (default 200), --gamma, which sets the gamma parameter for the bgmm model (default 0.001), --components, which sets the range of the number of components to fit (default 1 4), --weighted which determines whether to use node-weighted method for de-redundancy, --guide which determines which regime residing anchors to be used (default Segment), --prominence_cutoff which sets the prominence cutoff of acceptable peaks in peak finding process, --rel_height which sets the relative height at which the peak width is measured, --kstodate which manually sets the range of KS to be dated in heuristic search and needs to be co-set with option --manualset, --xlim and --ylim determining the x and y axis limit of GMM KS distribution, --ci setting the confidence level of log-normal distribution to date (default 95), --hdr setting the highest density region (HDR) applied in the segment-guided anchor pair KS distribution (default 95), --heuristic determining whether to initiate heuristic method of defining CI for dating, --kscutoff setting the KS saturation cutoff in dating (default 5).

A suggested starting run can use command simply as below

wgd peak ksdata -ap apdata -sm smdata -le ledata -mp mpdata --heuristic

wgd focus

The concatenation-based/coalescence-based phylogenetic inference

wgd focus families sequence1 sequence2 sequence3 (--concatenation) (--coalescence) (-tree 'fasttree') (-ts '-fastest') (-n 4) (--to_stop) (--cds) (-o wgd_focus) (-t working_tmp)

The program wgd focus implemented two basic phylogenetic inference methods, i.e., concatenation-based and coalescence-based methods. To initiate these analysis, users need to set the flag option --concatenation or --coalescence. The concatenation-based method includes a few major steps, i.e., the multiple sequence alignment (MSA) of each gene family, the concatenation of all gene families and then the gene tree (also species tree in this case) inference. The coalescence-based method will instead perform the MSA of each gene family and the gene tree inference based on each MSA, and then infer the species tree based on these individual gene trees. The tree-inference program and the associated parameters can be set just as above by -tree or --tree_method and -ts or --treeset. Please also make sure the chosen tree-inference program is installed in the environment path. The program astral-pro is required to be installed in the environment path if the coalescence-based method is chosen. The parallel threads here are to accelerate the sequence alignment and gene tree inference for each gene family too and thus suggested to be set as much as the number of gene families.

A suggested starting run of this analysis is with the simple command below.

wgd focus families sequence1 sequence2 sequence3 --concatenation
wgd focus families sequence1 sequence2 sequence3 --coalescence

The functional annotation of gene families

wgd focus families sequence1 sequence2 sequence3 --annotation eggnog -ed eddata --dmnb dbdata

The program wgd focus also added some wrapping functions for functional annotation of gene families on the hood of databases and softwares for instance EggNOG and EggNOG-mapper. For the annotation using EggNOG-mapper, users need to provide the path to the eggNOG annotation database via the option -ed, the path to the diamond-compatible database via the option --dmnd_db and the option --annotation set as "eggnog". The manner how PFAM annotation will be performed can be controlled via the option pfam, either "none", "realign" or "denovo", the detailed explanation can be found at the wiki of EggNOG-mapper. Please pre-install the EggNOG-mapper python package if using this function. For the annotation using hmmscan, what is implemented in wgd v2 is a simple bundle function to perform hmmscan analysis for a given hmm profile and set of gene families such that users need to set the option --annotation as "hmmpfam" and provide the hmmprofile via the option --hmm. For the annotation using interproscan, users need to provide the path to the interproscan installation folder where there is a interproscan.sh file via the option --exepath and set the the option --annotation as "interproscan". The parallel threads here are to parallelize the analysis for each gene family and thus suggested to be set as much as the number of gene families.

A suggested starting run of this analysis can be with the command below.

wgd focus families sequence1 sequence2 sequence3 --annotation eggnog -ed eddata --dmnb dbdata
wgd focus families sequence1 sequence2 sequence3 --annotation hmmpfam --hmm hmmdata
wgd focus families sequence1 sequence2 sequence3 --annotation interproscan --exepath $PATH

The phylogenetic dating of WGDs

wgd focus families sequence1 sequence2 sequence3 -d mcmctree -sp spdata (--protcocdating) (--partition) (--aamodel lg) (-ds 'burnin = 2000') (-ds 'sampfreq = 1000') (-ds 'nsample = 20000')  (-n 4) (--to_stop) (--cds) (-o wgd_focus) (-t working_tmp)

The absolute dating of WGDs is a specific pipeline implemented in wgd v2 using the method of phylogenetic dating. The families used in this step can be produced from wgd dmd and wgd peak. Note that here we only discuss how to date WGDs with genome assembly, instead of transcriptome assembly (which will be discussed in a separate section hereunder). The assumption we made here is that not all anchor pairs (collinear duplicates) are suitable for phylogenetic dating, for instance those fastly or slowly evolving gene duplicates, because they're prone to give biased dating estimation. So as to retrieve the "reliable" anchor pairs, we implemented some methods of identifying crediable anchor pairs based on their Ks values and/or residing collinear segments. For a genome with clear signals of putative WGDs, such as the Aquilegia coerulea in the example, a heuristic method that applies the principle of how ksrates find the initial peaks and their parameters was implemented to find the 95% confidence level of the assumed lognormal distribution of the anchor pair KS age distribution to filter anchor pairs with too high or low KS ages, the examplar command of which is showed in the wgd peak section of Illustration. If this heuristic method failed to give reasonable results, which is usually due to the effect of multiple adjacent WGDs that blurs the peak-finding process, users can turn to the collinear segments-guided anchor pair clustering implemented in wgd v2, in which a collinear segment-wise GMM clustering will be first conducted based on the "so-called" segment KS age represented by the median KS age of all the residing syntelogs (note that the gene duplicate pairs adopted in this step is from the file multiplicon_pairs.txt which contains the full set of syntelogs), instead of the smaller gene set of anchor pairs. The distinction between anchor pairs and syntelogs is that the latter refers to multiple sets of genes derived from the same ancestral genomic region while the former implies the latter but requires additionally the conserved gene order, both of which are, within a genome, assumed to be originated from the duplication of a common ancestral genomic region and as such deemed evidence for WGD. Then the syntelogs will be mapped back according to the clustering results of their affiliated segments. Since the Gaussian shape of the segment cluster doesn't necessarily imitate the shape of the residing syntelogs, so as to retrieve the "reliable" gene sets for phylogenetic dating, we adpoted the (95%) highest density region (HDR) of the syntelog KS age distribution for the phylogenetic dating, the calculation of the (95%) HDR in the function calculateHPD seeks the shortest KS range (a,b) which satisfies the requirement of spanning more than (95%) of all the KS values. On the premise of identified anchor pairs (note that the syntelogs are also referred as anchor pairs hereafter), we implemented in the program wgd dmd the so-called anchor-aware local MRBHs or orthogroups, in which the original local MRBHs are further merged with anchor pairs such that each orthogroup contains the anchor pair and the orthologues. With this orthogroup, users then need one starting tree file (as shown in the Illustration part) indicating the tree topology and fossil calibration information for the final phylogenetic dating using the program wgd focus. Users can set the flag option --protcocdating to only conduct the dating of the concatenation protein MSA or the flag option --protdating to only conduct the dating of the protein MSAs, noted that these two options only work for the mcmctree option so far. The flag option --partition can be set to perform mcmctree analysis using the partitioned data (i.e., 1st, 2nd and 3rd position of codon) instead of using the codon as a whole. The option --aamodel can be set to determine the amino acid model applied in mcmctree analysis. The option -ds can be used to set parameters for the molecular dating program. The parallel threads here are to parallelize the analysis for each gene family and thus suggested to be set as much as the number of gene families.

A suggested starting run can be with the command below.

wgd focus families sequence1 sequence2 sequence3 -d mcmctree -sp spdata
wgd focus families sequence1 sequence2 sequence3 -d r8s -sp spdata --nsites nsiteinfo
wgd focus families sequence1 sequence2 sequence3 -d beast --fossil fossilinfo --rootheight rootheightinfo --chainset chainsetinfo --beastlgjar $PATH

Illustration

We illustrate our program on an exemplary WGD inference and dating upon species Aquilegia coerulea.

The Aquilegia coerulea was reported to experience an paleo-polyploidization event after the divergence of core eudicots, which is likely shared by all Ranunculales.

First above all, let's delineate the whole paranome KS age distribution and have a basic observation for potentially conceivable WGDs, using the command line below.

wgd dmd Aquilegia_coerulea
wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea

The constructed whole paranome KS age distribution of Aquilegia coerulea is as below, we can see that there seems to be a hump at KS 1 but not clear.

We then construct the anchor KS age distribution using the command line below.

wgd syn -f mRNA -a Name wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea.gff3 -ks wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv

As shown below, there are some retained anchor pairs with KS between 1 and 2, which seems to suggest a WGD event.

The associated dupStack plot shows that there are numerous duplicated segments across most of the chromosomes.

We implemented two types of dot plots in oxford grid: one in the unit of bases and the other in the unit of genes, which can be colored by KS values given KS data.

As shown above, the dot plot in the unit of genes presents numerous densely aggregated (line-like) anchor points at most of the chromosomes with consistent KS age between 1 and 2. The dot plot in the unit of bases shows the same pattern, as manifested below.

The dot plots without KS annotation will also be automately produced, as shown below.

Note that the opacity of anchor dots and all homolog dots can be set by the option --apalpha and --hoalpha separately. If one just wants to see the anchor dots, setting the hoalpha as 0 (or other minuscule values) will do. If one wants to see the distribution of whole dots better, setting the hoalpha higher (and apalpha lower) will do. The dotsize option can be called to adjust the size of dots.

A further associated Syndepth plot shows that there are more than 50 duplicated segments longer than 10000 bp and 30 genes (so as to drop fragmentary segments), which dominates the whole collinear ratio category.

We can fit an ELMM mixture model upon the whole paranome KS age distribution to see more accurately the significance and location of potential WGDs, using the command line below.

wgd viz -d wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv

The result of ELMM mixture model clustering shows that there is a likely WGD component at KS 1.19.

Let's do a mixture model clustering for anchor KS too, using the command line below. Note that this step will automately call the segment KS clustering analysis too.

wgd peak wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv --anchorpoints wgd_syn/iadhore-out/anchorpoints.txt --segments wgd_syn/iadhore-out/segments.txt --listelements wgd_syn/iadhore-out/list_elements.txt --multipliconpairs wgd_syn/iadhore-out/multiplicon_pairs.txt (--weighted)

The anchor KS age distribution also has a likely WGD component with mode 1.28.

Now that we have seen the evidence of numerous duplicated segments and the aggregation of duplicates age at KS 1.28 or 1.19 for anchor pairs and non-anchor pairs throughout the whole genome. We can claim with some confidence that Aquilegia coerulea might have experienced a paleo-polyploidization event. Next, Let's have a further look about its phylogenetic location. We know that there are uncertainties about whether this putative paleo-polyploidization event is shared with all eudicots or not. We can choose some other eudicot genomes to see the ordering of speciation and polyploidization events. Here we choose Vitis vinifera, Protea cynaroides and Acorus americanus in the following KS analysis. First, we built a global MRBH family using the command below.

wgd dmd --globalmrbh Aquilegia_coerulea Protea_cynaroides Acorus_americanus Vitis_vinifera -o wgd_globalmrbh

In the global MRBH family, every pair of orthologous genes is the reciprocal best hit, suggesting true orthologous relationships. We would use the KS values associated with these orthologous pairs to delimit the divergence KS peak. Together with the whole paranome KS distribution, we conduct the rate correction using the command below.

!!Since wgd version 2.0.24, we rewrote a cleaner and quicker way of doing substitution rate correction. It's not required to type in any speices pair and a series of KS plots will be produced. The required files are orthologous KS datafile, paralogous KS datafile, a species tree and a focal species (the one inputted with paralogous KS data). Users can choose to add one more layer of ELMM analysis on paralogous KS values and/or GMM analysis on anchor KS distribution. The orthologous KS distribution can be calculated using the command below.

wgd ksd wgd_globalmrbh/global_MRBH.tsv Aquilegia_coerulea Protea_cynaroides Acorus_americanus Vitis_vinifera -o wgd_globalmrbh_ks

With the calculated orthologous KS distribution, we can use the command below to conduct the rate correction and/or mixture modeling analysis.

wgd viz -d wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv -fa Aquilegia_coerulea -epk wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -ap wgd_syn/iadhore-out/anchorpoints.txt -sp speciestree.nw -o wgd_viz_mixed_Ks --plotelmm --plotapgmm --reweight

or using the command below, which combines the two steps above in one. Note that we suggest of taking two separate steps in which wgd ksd undertakes the calculation of orthologous KS distribution while wgd viz carries out the rate correction and GMM analysis such that it's easier to debug.

wgd ksd wgd_globalmrbh/global_MRBH.tsv Aquilegia_coerulea Protea_cynaroides Acorus_americanus Vitis_vinifera --extraparanomeks wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -sp speciestree.nw --reweight -o wgd_globalmrbh_ks_rate_correction -fa Aquilegia_coerulea -ap wgd_syn/iadhore-out/anchorpoints.txt --plotelmm --plotapgmm

The file speciestree.nw is the text file of species tree in newick that rate correction would be conducted on. Its content is as below. Users can optionally provide the species pairs to be plotted but we suggest of just using -fa Aquilegia_coerulea to plot all possible focal-sister species pairs. We suggest adding the option --reweight to recalculate the weight per species pair such that the weight of orthologous gene pairs will become 1. Extra collinear data can be added by the option -ap and additional clustering analysis can be initiated by setting the option --plotapgmm and --plotelmm.

(((Vitis_vinifera,Protea_cynaroides),Aquilegia_coerulea),Acorus_americanus);

The mixed KS distribution shown above is a publication-ready figure that assembles the results of ELMM, GMM and rate correction. The one-vs-one orthologous KS distributions is also automatically produced with rate correction results superimposed where available as shown below.

Besides the above ksrates-like KS plot, a more classic KS plot can be made by adding the option --classic to tap more detailedly into the variation of synonymous substitution rate.

Using the command below, the direction of rate correction and degree of rate variation can be observed more directly.

wgd viz -d wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv -fa Aquilegia_coerulea -epk wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -ap wgd_syn/iadhore-out/anchorpoints.txt -sp speciestree.nw -o wgd_viz_mixed_Ks --plotelmm --plotapgmm --reweight --plotkde --classic

As shown above, because of the higher substitution rate of Aquilegia coerulea, the original orthologous KS values were actually underestimated in the time-frame of Aquilegia coerulea. When we recovered the divergence substitution distance in terms of two times of the branch-specific contribution of A. coerulea since its divergence with the sister species plus the shared substitution distance before divergence (in relative to the outgroup), the corrected KS mode became larger.

Note that we can easily show that Aquilegia coerulea has higher substitution rate than Protea cynaroides and Vitis vinifera by comparing their substitution distance in regard to the same divergence event with outgroup species Acorus_americanus, using command below.

wgd viz -d wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv -sp speciestree.nw --reweight -o wgd_viz_Compare_rate --spair "Acorus_americanus;Protea_cynaroides" --spair "Aquilegia_coerulea;Acorus_americanus" --spair "Vitis_vinifera;Acorus_americanus" --plotkde --classic

As displayed above, the orthologous KS values bewteen Aquilegia coerulea and Acorus americanus has the highest mode, indicating the faster substitution rate of A. coerulea compared to Protea cynaroides and Vitis vinifera.

Before v2.0.21, the gene-species map file is neccessarily needed for its implementation in wgd viz, which should be automately produced by the last wgd ksd step given the spair and speciestree parameters. The gene_species.map has contents as below in which each line is the joined string of gene name and species name by space. After v2.0.21 (included), the gene-species map file is not neccessarily needed anymore.

Aqcoe6G057800.1 Aquilegia_coerulea
Vvi_VIT_201s0011g01530.1 Vitis_vinifera
Pcy_Procy01g08510 Protea_cynaroides
Aam_Acora.04G142900.1 Acorus_americanus

An alternative way to calculate the orthologous KS is to directly use the orthogroups instead of global MRBH family. That way we don't use the strictly 1-vs-1 orthologues as global MRBH but all the orthologous gene pairs inside each orthogroup instead. To achieve that, we first need to infer orthogroups using the command below.

wgd dmd Aquilegia_coerulea Protea_cynaroides Acorus_americanus Vitis_vinifera --orthoinfer -o wgd_ortho (--onlyortho) 

Users can decide to only conduct the orthogroup analysis while skipping other analysis by adding the flag --onlyortho. Next step is the same with global MRBH family.

wgd ksd wgd_ortho/Orthogroups.sp.tsv Aquilegia_coerulea Protea_cynaroides Acorus_americanus Vitis_vinifera -o wgd_ortho_ks
wgd viz -d wgd_ortho_ks/Orthogroups.sp.tsv.ks.tsv -epk wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -sp speciestree.nw --reweight -ap wgd_syn/iadhore-out/anchorpoints.txt -plotelmm --plotapgmm -o wgd_ortho_ks_rate_correction

As shown above, the number of orthologous gene pairs is different than the one from global MRBH families in that here we plotted all orthologous gene pairs instead of only global MRBH families, together with different recalculated weights.

After the phylogenetic timing of the Ranunculales WGD, we can further infer its absolute age. First we infer the credible range of anchor pairs by KS heuristically using the program wgd peak.

wgd peak --heuristic wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -ap wgd_syn/iadhore-out/anchorpoints.txt -sm wgd_syn/iadhore-out/segments.txt -le wgd_syn/iadhore-out/list_elements.txt -mp wgd_syn/iadhore-out/multiplicon_pairs.txt -o wgd_peak

As shown above, we assumed a lognormal distribution at the peak location detected by the signal module of scipy library. The 95% confidence level of the lognormal distribution was applied, i.e., 0.68-2.74, in further molecular dating. The file Aquilegia_coerulea.tsv.ks.tsv_95%CI_AP_for_dating_weighted_format.tsv is what we need for next step. To build the orthogroups used in phylogenetic dating, we need to select some species and form a starting tree with proper fossil calibrations. We provide one in mcmctree format as below.

17 1
((((Potamogeton_acutifolius,(Spirodela_intermedia,Amorphophallus_konjac)),(Acanthochlamys_bracteata,(Dioscorea_alata,Dioscorea_rotundata))'>0.5600<1.2863')'>0.8360<1.2863',(Acorus_americanus,Acorus_tatarinowii))'>0.8360<1.2863',((((Tetracentron_sinense,Trochodendron_aralioides),(Buxus_austroyunnanensis,Buxus_sinica))'>1.1080<1.2863',(Nelumbo_nucifera,(Telopea_speciosissima,Protea_cynaroides)))'>1.1080<1.2863',(Aquilegia_coerulea_ap1,Aquilegia_coerulea_ap2))'>1.1080<1.2863')'>1.2720<2.4720';

As presented above, the focal species that is about to be dated needs to be replaced with (Aquilegia_coerulea_ap1,Aquilegia_coerulea_ap2). With this starting tree and predownloaded cds files of all the species, we can build the orthogroup used in the final molecular dating using the command as below. Note that here we assume other species in the starting tree do not share the WGD to be dated such that the topology of starting tree is correct, otherwise we need to further discern the ap1 and ap2 for other species as well, and then group all ap1 in one branch and all ap2 in another branch. In that sense, holding the focal species as the only one who shared the WGD to be dated in the starting tree is a simplified but correct practice.

wgd dmd -f Aquilegia_coerulea -ap wgd_peak/AnchorKs_FindPeak/Aquilegia_coerulea.tsv.ks.tsv_95%CI_AP_for_dating_weighted_format.tsv -o wgd_dmd_ortho Potamogeton_acutifolius Spirodela_intermedia Amorphophallus_konjac Acanthochlamys_bracteata Dioscorea_alata Dioscorea_rotundata Acorus_americanus Acorus_tatarinowii Tetracentron_sinense Trochodendron_aralioides Buxus_austroyunnanensis Buxus_sinica Nelumbo_nucifera Telopea_speciosissima Protea_cynaroides Aquilegia_coerulea

The result file merge_focus_ap.tsv is what we need for the final step of molecular dating in program wgd focus.

wgd focus --protcocdating --aamodel lg wgd_dmd_ortho/merge_focus_ap.tsv -sp dating_tree.nw -o wgd_dating -d mcmctree -ds 'burnin = 2000' -ds 'sampfreq = 1000' -ds 'nsample = 20000' Potamogeton_acutifolius Spirodela_intermedia Amorphophallus_konjac Acanthochlamys_bracteata Dioscorea_alata Dioscorea_rotundata Acorus_americanus Acorus_tatarinowii Tetracentron_sinense Trochodendron_aralioides Buxus_austroyunnanensis Buxus_sinica Nelumbo_nucifera Telopea_speciosissima Protea_cynaroides Aquilegia_coerulea

Here we only implemented the concatenation analysis using protein sequence by adding the flag --protdating and we set the parameter for mcmctree via the option -ds. Note that other dating program such as r8s and beast are also available given some mandatory parameters. The final log of the successful run is as below.

16:04:25 INFO     Running mcmctree using Hessian matrix of LG+Gamma  core.py:967
                  for protein model
23:49:37 INFO     Posterior mean for the ages of wgd is 112.8945 mcmctree.py:296
                  million years from Concatenated peptide
                  alignment and 95% credibility intervals (CI)
                  is 101.224-123.121 million years
         INFO     Total run time: 29175s                              cli.py:241
         INFO     Done                                                cli.py:242

To visualize the date, we also provided a python script to plot the WGD dates in the wgd folder. Users need to extract the raw dates from the mcmc.txt for the WGD node first and save it as file dates.txt (or whatever preferred name). An example command is as below.

python $PATH/postplot.py postdis dates.txt --percentile 90 --title "WGD date" --hpd -o "Ranunculales_WGD_date.svg"

Users can freely set the percentile for CI (either in HPD with --hpd or in Equal-Tailed CI without --hpd) and the title and output file name via --title and -o.

The posterior mean, median and mode of the Ranunculales WGD age is 112.92, 113.44 and 112.54 mya, with 90% HPD 105.07 - 122.32 mya as manifested above.

Kstree

In addition to pairwise KS estimation, a KS tree with branch length in KS unit can also be derived from the program wgd ksd given the option --kstree and --speciestree. Note that the additional option --onlyconcatkstree will only call the KS estimation for the concatenated alignment rather than all the alignments. Users need to provide a preset species tree for the KS tree inference of the concatenated alignment while the remaining alignments will be against an automately inferred tree from fasttree or iqtree. In the end, users will get a KS tree, a KA tree and a ω tree per fam and for the concatenated alignment.

wgd ksd data/kstree_data/fam.tsv data/kstree_data/Acorus_tatarinowii data/kstree_data/Amborella_trichopoda data/kstree_data/Aquilegia_coerulea data/kstree_data/Aristolochia_fimbriata data/kstree_data/Cycas_panzhihuaensi --kstree --speciestree data/kstree_data/species_tree1.nw --onlyconcatkstree -o wgd_kstree_topology1

Above we used three alternative topologies to infer the KS tree which led to different branch length estimation. Note that the families we used were only two global MRBH families for the purpose of illustration. To acquire an accurate profile of the substitution rate variation, orthologues at the whole genome scale should be used.

In addition, more exquisite collinear plots including both intra-specific and inter-specific comparisons using the orthogroups (composed of Aquilegia coerulea, Protea cynaroides, Acorus americanus and Vitis vinifera) inferred can be also produced using wgd syn. Note that different genome assemblies might have different features and attributes which can be accommodated via the option --additionalgffinfo for each genome assembly whose order needs to follow the order of gff3 files, for instance 'mNRA;Name' for Aquilegia_coerulea.gff3, 'mNRA;ID' for Protea_cynaroides.gff3, 'mNRA;Name' for Acorus_americanus.gff3 and 'mNRA;Name' for Vitis_vinifera.gff3.

wgd dmd -oo -oi Aquilegia_coerulea Protea_cynaroides Acorus_americanus Vitis_vinifera -o wgd_ortho
wgd ksd wgd_ortho/Orthogroups.sp.tsv Aquilegia_coerulea Protea_cynaroides Acorus_americanus Vitis_vinifera -o wgd_ortho_ks
wgd syn wgd_ortho/Orthogroups.sp.tsv -ks wgd_ortho_ks/Orthogroups.sp.tsv.ks.tsv Aquilegia_coerulea.gff3 --additionalgffinfo 'mNRA;Name' Protea_cynaroides.gff3 --additionalgffinfo 'mNRA;ID' Acorus_americanus.gff3 --additionalgffinfo 'mNRA;Name' Vitis_vinifera.gff3 --additionalgffinfo 'mNRA;Name' -o wgd_ortho_syn

Upon the acquisition of the collinear results using wgd syn, the same collinear plots can be also produced by wgd viz using the command below.

wgd viz --plotsyn -sm wgd_ortho_syn/iadhore-out/segments.txt -ap wgd_ortho_syn/iadhore-out/anchorpoints.txt -mt wgd_ortho_syn/iadhore-out/multiplicons.txt -gt wgd_ortho_syn/gene-table.csv -d wgd_ortho_ks/Orthogroups.sp.tsv.ks.tsv -o wgd_ortho_viz

The above dupStack plot shows the distribution of duplicated segments of Aquilegia coerulea compared to itself (in green) and compared to Vitis vinifera (in blue) over the chromosomes of A. coerulea. The above KS dotplot in unit of gene shows the overall distribution of collinearity acorss the four species involved. The above dotplot is without the annotation of KS ages compared to the last one. The above Syndepth plot shows the collinear ratio acorss all species pairs (intra-specific comparison in green while inter-specific comparison in blue).

Citation

Please cite us at https://doi.org/10.1007/978-1-0716-2561-3_1 and https://doi.org/10.1093/bioinformatics/btae272.

Hengchi Chen, Arthur Zwaenepoel (2023). Inference of Ancient Polyploidy from Genomic Data. In: Van de Peer, Y. (eds) Polyploidy. Methods in Molecular Biology, vol 2545. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2561-3_1
Hengchi Chen, Arthur Zwaenepoel, Yves Van de Peer (2024). wgd v2: a suite of tools to uncover and date ancient polyploidy and whole-genome duplication. Bioinformatics, Volume 40, Issue 5, May 2024, btae272, https://doi.org/10.1093/bioinformatics/btae272

For citation of the tools used in wgd, please consult the documentation at https://wgdv2.readthedocs.io/en/latest/citation.html.

wgd's People

Contributors

arzwa avatar heche-psb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

wgd's Issues

wgd dmd - struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Hello!

I am trying to run WGD v2 on some transcriptome data. I have successfully run wgd dmd on each sample independently (e.g., wgd dmd Sample1.fasta -of). However, when I try to do pairwise (e.g., wgd dmd Sample1.fasta Sample2.fasta -of) I get this error with some samples:

Traceback (most recent call last):
  File "/home/ermoore3/miniconda2/envs/mamba/envs/wgdv2/bin/wgd", line 8, in <module>
    sys.exit(cli())
  File "/home/ermoore3/miniconda2/envs/mamba/envs/wgdv2/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ermoore3/miniconda2/envs/mamba/envs/wgdv2/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ermoore3/miniconda2/envs/mamba/envs/wgdv2/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ermoore3/miniconda2/envs/mamba/envs/wgdv2/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ermoore3/miniconda2/envs/mamba/envs/wgdv2/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ermoore3/miniconda2/envs/mamba/envs/wgdv2/lib/python3.6/site-packages/cli.py", line 117, in dmd
    _dmd(**kwargs)
  File "/home/ermoore3/miniconda2/envs/mamba/envs/wgdv2/lib/python3.6/site-packages/cli.py", line 155, in _dmd
    Parallel(n_jobs=nthreads,backend='multiprocessing')(delayed(parallelrbh)(s,i,j,ogformat,cscore,eval) for i,j in pairs)
  File "/home/ermoore3/miniconda2/envs/mamba/envs/wgdv2/lib/python3.6/site-packages/joblib/parallel.py", line 789, in __call__
    self.retrieve()
  File "/home/ermoore3/miniconda2/envs/mamba/envs/wgdv2/lib/python3.6/site-packages/joblib/parallel.py", line 699, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/ermoore3/miniconda2/envs/mamba/envs/wgdv2/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/home/ermoore3/miniconda2/envs/mamba/envs/wgdv2/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
    put(task)
  File "/home/ermoore3/miniconda2/envs/mamba/envs/wgdv2/lib/python3.6/site-packages/joblib/pool.py", line 372, in send
    self._writer.send_bytes(buffer.getvalue())
  File "/home/ermoore3/miniconda2/envs/mamba/envs/wgdv2/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/ermoore3/miniconda2/envs/mamba/envs/wgdv2/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

I believe this error is because the input files are too large. I believe this because when running wgd dmd for a single species (e.g., wgd dmd Sample1.fasta -of), the size of the resulting .tsv file for the failing samples are ~8x larger than the other samples that ran successfully.

Do you believe I am correct? If so, do you have any suggestions on how to fix this?

Any help is appreciated!

Best,
Erika

Suggestions on the result

Thank you for the convenient tool!

I have successfully performed the analysis, however, I am a bit confused about the result interpretation and final presentation of the result in a standard way.

I followed these commands

  1. wgd dmd --globalmrbh SPECIES_cds Zea_mays_cds Amborella_trichopoda_cds Musa_acuminata_cds --cds -n 90
  2. wgd ksd wgd_dmd/global_MRBH.tsv --extraparanomeks ../wgd_ksd/SPECIES_cds.tsv.ks.tsv -sp speciestree.nw -o wgd_globalmrbh_ks --spair "SPECIES_cds;Musa_acuminata_cds" --spair "SPECIES_cds;Amborella_trichopoda_cds" --spair "SPECIES_cds;Zea_mays_cds" --spair "SPECIES_cds;SPECIES_cds" --reweight --plotkde
  3. wgd viz -d wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv -sp speciestree.nw --extraparanomeks ../wgd_ksd/SPECIES_cds.tsv.ks.tsv --spair "SPECIES_cds;Musa_acuminata_cds" --spair "SPECIES_cds;Amborella_trichopoda_cds" --spair "SPECIES_cds;Zea_mays_cds" --spair "SPECIES_cds;SPECIES_cds" --reweight --plotkde

the results from 2nd and 3rd are attached

I wanted to know why I am not getting the SPECIES_CDS paranome in the 2nd figure (SPECIES_cds_Corrected.ksd.averaged.pdf)? and can we use this 2nd figure to infer that SPECIES_CDS and Musa_acuminata_cds shared the same WGD event which happened after the divergence of SPECIES_CDS with Zea mays and Amborella?

Thanks
SPECIES_cds_Corrected.ksd.weighted.pdf
SPECIES_cds_Corrected.ksd.averaged.pdf

Issue with the spair flag

Hi,

I'm trying to run wgd ksd with the spair flag:

wgd ksd global_MRBH.tsv *.fa --extraparanomeks SalCuc.fa.tsv.ks.tsv -sp speciestree.nw --reweight -o wgd_globalmrbh_ks_new --spair "SalCuc.fa;SalCuc.fa" --spair "SalCuc.fa;AzoFil.fa" --spair "SalCuc.fa;CerRic.fa" --spair "SalCuc.fa;AdiCap.fa" --spair "SalCuc.fa;AlsSpi.fa" --spair "SalCuc.fa;CibBar.fa" --spair "SalCuc.fa;DicPed.fa" --plotkde

and I get:

Error: Invalid value for '[SEQUENCES]...': Path ' ' does not exist.

The species tree file matches the --spair option. What could be the issue?

The dataset to reproduce

Best,
Evgenii

ERROR: Rate correction using wgd ksd wgd_globalmrbh

First of all Thanks a lot for the tool and the support provided with the previous issue.
I want to do the rate correction using wgd ksd wgd_globalmrbh command but I am encountering the follwoing error;
sys.exit(cli())
File "/home/genomics7/SOFTWARES/wgd/ENV/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/genomics7/SOFTWARES/wgd/ENV/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/genomics7/SOFTWARES/wgd/ENV/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/genomics7/SOFTWARES/wgd/ENV/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/genomics7/SOFTWARES/wgd/ENV/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/genomics7/SOFTWARES/wgd/ENV/lib/python3.7/site-packages/cli.py", line 451, in ksd
_ksd(**kwargs)
File "/home/genomics7/SOFTWARES/wgd/ENV/lib/python3.7/site-packages/cli.py", line 490, in _ksd
multi_sp_plot(df,spair,spgenemap,outdir,onlyrootout,title=prefix,ylabel=ylabel,ksd=True,reweight=reweight,sptree=speciestree,extraparanomeks=extraparanomeks, ap = anchorpoints,plotkde=plotkde,plotapgmm=plotapgmm,plotelmm=plotelmm,components=components,na=True)
File "/home/genomics7/SOFTWARES/wgd/ENV/lib/python3.7/site-packages/wgd/viz.py", line 499, in multi_sp_plot
df_perspair,allspair,paralog_pair,corrected_ks_spair,Outgroup_spnames = getspair_ks(spair,df,spgenemap,reweight,onlyrootout,sptree=sptree)
File "/home/genomics7/SOFTWARES/wgd/ENV/lib/python3.7/site-packages/wgd/viz.py", line 63, in getspair_ks
if sptree != None and len(paralog_pair) !=0 : corrected_ks_spair,Outgroup_spnames = correctks(df,sptree,paralog_pair[0],reweight,onlyrootout)
File "/home/genomics7/SOFTWARES/wgd/ENV/lib/python3.7/site-packages/wgd/viz.py", line 170, in correctks
ks_spair = getspairks(all_spairs,df,reweight,method='mode')
File "/home/genomics7/SOFTWARES/wgd/ENV/lib/python3.7/site-packages/wgd/viz.py", line 129, in getspairks
kde = stats.gaussian_kde(y,weights=w,bw_method=0.1)
File "/home/genomics7/SOFTWARES/wgd/ENV/lib/python3.7/site-packages/scipy/stats/kde.py", line 193, in init
raise ValueError("dataset input should have multiple elements.")
ValueError: dataset input should have multiple elements.

I have no idea what went wrong! Please help.

Thank you

ksd .tsv out

Hi, thanks for the great tool!

I have a problem with .tsv file. Tool and command were successfully run and I got the plots. But when I tried the "peak" command, got the error. Then when I looked at my .tsv file there was no numeric counts:

Ekran Resmi 2024-05-21 09 48 23

Here is my input file looks like:

Ekran Resmi 2024-05-21 09 48 56

dmd command output:
Ekran Resmi 2024-05-21 09 49 39

Thanks for your time,

İlayda

Wgd_syn error (File not found anchorpoints.txt)

Hi,
Thanks for this commendable tool to detect Whole Genome Duplication
I installed the tool and it gave the successful results for wgd ksd for the whole paranome analysis and now I tried to run wgd syn command to detect the anchor ks distribution but i encountered with following issue;
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/HDD1/WGD/wgd_syn/iadhore-out/anchorpoints.txt'

A snippet of the whole error is here
Write statistics = false
Alignment method = GreedyGraphbased4
Multiple hypothesis correction = FDR
Number of threads = 1
Compare aligners = false
Collinear searches only
Visualize GHM.png = false
Visualize Alignment = false
Verbose output = true
************ END i-AdDHoRe parameters *********

              Creating dataset...                                           
     INFO     Processing I-ADHoRe output                          cli.py:652

Traceback (most recent call last):
File "/home/samuelG/.conda/envs/samuel/bin/wgd", line 8, in
sys.exit(cli())
File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/cli.py", line 613, in syn
_syn(**kwargs)
File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/cli.py", line 654, in _syn
anchors,orig_anchors = get_anchors(out_path)
File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/wgd/syn.py", line 181, in get_anchors
else: anchors = pd.read_csv(os.path.join(out_path, "anchorpoints.txt"), sep="\t", index_col=0)
File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 482, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 811, in init
self._engine = self._make_engine(self.engine)
File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 51, in init
self._open_handles(src, kwds)
File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/pandas/io/parsers/base_parser.py", line 229, in _open_handles
errors=kwds.get("encoding_errors", "strict"),
File "/home/samuelG/.conda/envs/samuel/lib/python3.7/site-packages/pandas/io/common.py", line 707, in get_handle
newline="",
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/HDD1/WGD/wgd_syn/iadhore-out/anchorpoints.txt'

Differentiation between true wgd and artifact

Hi, Thanks for the tool and its proper documentation!

I successfully run the analysis for my paralogs and I got the results attached from wgd ksd, wgd syn. However, I am confused about the peak consideration. As you can see in the file with the paranome ks distribution, I am getting two peaks at 0.04 and 0.5. and I am also attaching the anchor pair file and wgd peak result of these anchor pairs. I think 0.5 ks value is more reliable as compared 0.05, but I am not able to figure this out how should I negiate the peak at 0.05 as false positive as it can also be due to transposon activity and subgenome divergence. Please help!
WGD_paralogs.pdf
final_true_cds_WH_single.fasta.tsv.ksd.pdf
AnchorKs_PeakCI_final_true_cds_WH_single.fasta.tsv.ks.tsv_node_weighted.pdf

Thanks
Manohar

Error installing wgd2

Hi,
I am getting issues installing wgd2. I get the following error message.

Getting requirements to build wheel ... error
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [17 lines of output]
Traceback (most recent call last):
File "/home/wgd/ENV/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in
main()
File "/home/wgd/ENV/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/home/wgd/ENV/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 130, in get_requires_for_build_wheel
return hook(config_settings)
File "/scratch/tmp/pip-build-env-3on5vztu/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 341, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/scratch/tmp/pip-build-env-3on5vztu/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 323, in _get_build_requires
self.run_setup()
File "/scratch/tmp/pip-build-env-3on5vztu/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 487, in run_setup
super(_BuildMetaLegacyBackend,
File "/scratch/tmp/pip-build-env-3on5vztu/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 338, in run_setup
exec(code, locals())
File "", line 12, in
ModuleNotFoundError: No module named 'numpy'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

ksd not finishing

Hello,

I've been trying to use wgd V2 to calculate a Ks peak

I've taken a subset of my CDS (~3K sequences)

When I run

 wgd dmd subset.fasta 

I get:
Screenshot 2024-06-25 at 11 23 09
But it finishes. So if I run:

wgd ksd wgd_dmd/subset.fasta.tsv subset.fasta

It will start off okay and just slow to a stop. I also get this error, for pretty much each gene family:
Screenshot 2024-06-25 at 11 29 45

Can Orthologous Isoforms be identified?

Hi, @heche-psb

Generally, homologous gene pairs are identified in comparative genomes, but here I use the full-length transcriptome to identify homologous isoforms of two species. Is it possible to do this?

Best wishes!

Error running wgd dmd with the globalmrbh flag

Hi,

I have an issue running wgd v2.0.29 with the globalmrbh flag [wgd dmd --globalmrbh -I 2 -e 1e-10 -bs 100 -np 5 --to_stop --cds SalCuc.fa AzoFil.fa -n 40 -o wgd_globalmrbh -t tmp]:

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/user1/mambaforge/envs/WGD/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 350, in call
return self.func(*args, **kwargs)
File "/home/user1/mambaforge/envs/WGD/lib/python3.8/site-packages/joblib/parallel.py", line 131, in call
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/home/user1/mambaforge/envs/WGD/lib/python3.8/site-packages/joblib/parallel.py", line 131, in
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/home/user1/mambaforge/envs/WGD/lib/python3.8/site-packages/wgd/core.py", line 822, in get_mrbh
s_i.get_rbh_orthologs(s_j, cscore, False, eval=eval)
File "/home/user1/mambaforge/envs/WGD/lib/python3.8/site-packages/wgd/core.py", line 391, in get_rbh_orthologs
df = self.run_diamond(seqs, orthoinfer, eval=eval)
File "/home/user1/mambaforge/envs/WGD/lib/python3.8/site-packages/wgd/core.py", line 384, in run_diamond
df = normalizebitscore(self.gene_length,df,outpath,sgidmaps=sgidmaps,idmap=self.idmap,seqmap=seqs.idmap,bins = self.bins,hitper = self.np).drop(columns=[11,12]).rename(columns={13:11})
File "/home/user1/mambaforge/envs/WGD/lib/python3.8/site-packages/wgd/core.py", line 169, in normalizebitscore
df.loc[:,15] = df[1].apply(lambda x:combinedidmaps[x])
File "/home/user1/mambaforge/envs/WGD/lib/python3.8/site-packages/pandas/core/series.py", line 4433, in apply
return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File "/home/user1/mambaforge/envs/WGD/lib/python3.8/site-packages/pandas/core/apply.py", line 1088, in apply
return self.apply_standard()
File "/home/user1/mambaforge/envs/WGD/lib/python3.8/site-packages/pandas/core/apply.py", line 1143, in apply_standard
mapped = lib.map_infer(
File "pandas/_libs/lib.pyx", line 2870, in pandas._libs.lib.map_infer
File "/home/user1/mambaforge/envs/WGD/lib/python3.8/site-packages/wgd/core.py", line 169, in
df.loc[:,15] = df[1].apply(lambda x:combinedidmaps[x])
KeyError: 'SalCuc.fa_04013'

Attachments:

  1. log.txt
  2. The dataset used Data_CDS.zip

Best,
Evgenii

Error running wgd

Hi,
This is great tool, i have used version 1. Now working with version2. I managed to install with conda, however i am getting following error

wgd -h
Usage: wgd [OPTIONS] COMMAND [ARGS]...
wgd v2 - Copyright (C) 2023-2024 Hengchi Chen
Contact: [email protected]
Options:
-v, --verbosity [info|debug] Verbosity level, default = info.
-h, --help Show this message and exit.
Commands:
dmd All-vs-all diamond blastp + MCL clustering.
focus Multiply species RBH or c-score defined orthologous family's gene...
ksd Paranome and one-to-one ortholog Ks distribution inference...
mix Mixture modeling of Ks distributions.
peak Infer peak and CI of Ks distribution.
syn Co-linearity and anchor inference using I-ADHoRe.
viz Visualization of Ks distribution or synteny

wgd dmd
09:04:59 INFO This is wgd v1.2 cli.py:32
Traceback (most recent call last):
File "/home/.conda/envs/WGD/bin/wgd", line 10, in
sys.exit(cli())
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/.conda/envs/WGD/lib/python3.6/site-packages/cli.py", line 113, in dmd
_dmd(**kwargs)
File "/home/.conda/envs/WGD/lib/python3.6/site-packages/cli.py", line 116, in _dmd
from wgd.core import SequenceData, read_MultiRBH_gene_families,mrbh,ortho_infer,genes2fams,endt,segmentsaps,bsog
ModuleNotFoundError: No module named 'wgd.core'

wgd viz
09:05:19 INFO This is wgd v1.2 cli.py:32
Traceback (most recent call last):
File "/home/.conda/envs/WGD/bin/wgd", line 10, in
sys.exit(cli())
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/.local/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/.conda/envs/WGD/lib/python3.6/site-packages/cli.py", line 533, in viz
_viz(**kwargs)
File "/home/.conda/envs/WGD/lib/python3.6/site-packages/cli.py", line 536, in _viz
from wgd.viz import elmm_plot, apply_filters, multi_sp_plot, default_plot,all_dotplots,filter_by_minlength,dotplotunitgene,dotplotingene,filter_mingenumber
ImportError: cannot import name 'elmm_plot'

Any help would be great.
Thanks

error in Ks correction analysis

Hello, thank you again for this really useful tool. I was successfully able to run several commands:

wgd dmd --globalmrbh *.fasta -o wgd_globalmrbh

All my sequence files are the longest isoform of the genes.
This successfully generated the family file.

I ran wgd ksd wgd_globalmrbh/global_MRBH.tsv *.fasta -n 16 -o wgd_globalmrbh_ks
and this ran through- and I got auto-generated Ks plots for the paranome as well.

When I run:

wgd viz -d wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv --extraparanomeks wgd_ksd/pafricana.chr_longest_isoform.cds.fasta.tsv.ks.tsv -sp speciestree.txt -o wgd_viz_mixed_Ks_1 --plotkde  --plotelmm --plotapgmm -o wgd_viz_try2 --spair "pafricana.chr_longest_isoform.cds.fasta;Fhyg__genBlastG_sorted_CDS.fasta" --spair "Ppatens_longest_isoform.cds.fasta;pafricana.chr_longest_isoform.cds.fasta" --spair "pafricana.chr_longest_isoform.cds.fasta;pafricana.chr_longest_isoform.cds.fasta"

I run into an error:

Traceback (most recent call last):
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/bin/wgd", line 33, in <module>
    sys.exit(load_entry_point('wgd', 'console_scripts', 'wgd')())
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/FCAM/vvuruputoor/wgd/wgd_venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/FCAM/vvuruputoor/wgd/cli.py", line 588, in viz
    _viz(**kwargs)
  File "/home/FCAM/vvuruputoor/wgd/cli.py", line 629, in _viz
    multi_sp_plot(df,spair,gsmap,outdir,onlyrootout,title=prefix,ylabel=ylabel,viz=True,plotkde=plotkde,reweight=False,sptree=speciestree,ap = anchorpoints, extraparanomeks=extraparanomeks,plotapgmm=plotapgmm,plotelmm=plotelmm,components=components,max_EM_iterations=em_iterations,num_EM_initializations=em_initializations,peak_threshold=prominence_cutoff,rel_height=rel_height, na=nodeaveraged,user_xlim=xlim,user_ylim=ylim,adjustortho=adjustortho,adfactor=adjustfactor,okalpha=okalpha,focus2all=focus2all,clean=classic)
  File "/home/FCAM/vvuruputoor/wgd/wgd/viz.py", line 625, in multi_sp_plot
    ratediffplot(df,outdir,focus2all,sptree,onlyrootout,reweight,extraparanomeks,ap,na=na,elmm=plotelmm,mEM=max_EM_iterations,nEM=num_EM_initializations,pt=peak_threshold,rh=rel_height,components=components,apgmm=plotapgmm)
  File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 784, in ratediffplot
    getspairplot_cov_cor(df,focusp,speciestree,onlyrootout,reweight,extraparanomeks,anchorpoints,outdir,na=na,elmm=elmm,mEM=mEM,nEM=nEM,pt=pt,rh=rh,components=components,apgmm=apgmm)
  File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 745, in getspairplot_cov_cor
    else: all_spairs,spairs,Trios,Trios_dict = gettrios_overall(focusp,Ingroup_spnames,Outgroup_spnames,Ingroup_clade)
  File "/home/FCAM/vvuruputoor/wgd/wgd/ratecorrect.py", line 648, in gettrios_overall
    sppair = "{}".format("__".join(sorted([sister,focusp])))
TypeError: '<' not supported between instances of 'NoneType' and 'str'

species trees have the same name as the fasta files, and i double checked the global_MRBH.tsv.ks.tsv and pafricana.chr_longest_isoform.cds.fasta.tsv.ks.tsv files- and there was no error in these steps (after seeing the recommendations across all other issues)

I also created my own gene_species.map - because that was not generated at the wgd ksd step, but even then i run into the same error.
Please let me know what I should change. Thank you!

weird results using test data in WGD ksd analysis

Hello, I'm trying to analyze WGD event using your software.

I set the software using pip on Conda environment with python version 3.8 and numpy version 1.19.0.
Using ugi1000.fasta in test/data directory, I performed wgd dmd and wgd ksd.
However, I got different format of ks.tsv file from ath.ks.tsv.

스크린샷(53)

What are problems in my works?

ERROR All families are singleton families, No Ks can be calculated - RBH

Hello!

I've been using wgd dmd for whole paranome and then wgd ksd get the Ks distribution of single species and everything works great. But when trying to run dmd and ksd for RBHs, I got the same error below no mater what samples I use.

ERROR All families are singleton families, No Ks can be calculated

I used all defaults parameters as in the examples. I there any parameter that I'm missing? I'm using wgd v2.0.30

wgd dmd sequence1 sequence2
wgd ksd families sequence1 sequence2

Thanks for you help!
Best,
Diego

Segmentation fault (core dumped)

Hi,

I am currently testing the wgd v2 software with the following command:

wgd dmd sample.longestcds.fa -o 01.step-dmd -n 10
However, I encountered an error message:

12:31:22 INFO     This is wgd v2.0.23                                                                                         cli.py:32     
Segmentation fault (core dumped)

Could you please advise on how to resolve this issue?

`wgd viz` throws error when including self-comparison as --speciespair

Hi again! I found a few bugs which are not absolutely prohibitive but would be good to fix at some point. I think I will post them here one by one.

I run substitution rate correction with wgd viz in wgd 2.0.22 as follows:

wgd viz -d wgd_ksd/global_MRBH.tsv.ks.tsv \
    -epk species1_ksd/species1.fa.tsv.ks.tsv -sp speciestree.txt -rw -ap species1_syn/iadhore-out/anchorpoints.txt  \
    -sr "species1.fa;species2.fa" -sr "species1.fa;species3.fa" -sr "species1.fa;species4.fa" -sr "species1.fa;species1.fa" \
    -gs wgd_ksd/gene_species.map --plotkde --plotelmm

and get the following StopIteration output:


14:15:25 INFO     This is wgd v2.0.22                                                                                                               cli.py:32
14:15:28 INFO     Implementing node-averaged Ks analysis                                                                                           viz.py:511
Traceback (most recent call last):
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/bin/wgd", line 11, in 
    load_entry_point('wgd', 'console_scripts', 'wgd')()
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/cli.py", line 557, in viz
    _viz(**kwargs)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/cli.py", line 597, in _viz
    multi_sp_plot(df,spair,gsmap,outdir,onlyrootout,title=prefix,ylabel=ylabel,viz=True,plotkde=plotkde,reweight=False,sptree=speciestree,ap = anchorpoints, extraparanomeks=extraparanomeks,plotapgmm=plotapgmm,plotelmm=plotelmm,components=components,max_EM_iterations=em_iterations,num_EM_initializations=em_initializations,peak_threshold=prominence_cutoff,rel_height=rel_height, na=True,user_xlim=xlim,user_ylim=ylim)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd/viz.py", line 527, in multi_sp_plot
    df_perspair,allspair,paralog_pair,corrected_ks_spair,Outgroup_spnames = getspair_ks(spair,df,reweight,onlyrootout,sptree=sptree,na=na,spgenemap=spgenemap)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd/viz.py", line 72, in getspair_ks
    if sptree != None and len(paralog_pair) !=0 : corrected_ks_spair,Outgroup_spnames = correctks(df,sptree,paralog_pair[0],reweight,onlyrootout,na=na)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd/viz.py", line 180, in correctks
    focussp_clade = next(tree.find_clades({"name": focusp}))
StopIteration

Removing the -sr "species1.fa;species1.fa" portion of the command results in a successful run. Can you reproduce this?

viz running problem

Hello, I encountered a very fast problem when using the viz module of wgd v2.0.25. According to the explanation on the official website, there will be a -fa option after v 2.2.24, but this option does not appear when I run wgd viz -h, and there will be an Error when I run it directly: no such option: -f

syn -gff3 file

Hi, I wrote about another step. I successfully finished others so thank you for your help!

Little question about step syn. I have a gff3 file from the Augustus , but when I run the command:

wgd syn /wgd_dmd/sample.cds.fasta.tsv /Augustus/sample.gff

gff3 file looks like:
Ekran Resmi 2024-05-28 12 56 50
I got the error like this:

Ekran Resmi 2024-05-28 12 54 05

Actually it's the same cds.fasta input for gff and wgd I didn't understand, so do you have any suggestions for getting the right gff3 file?

Many thanks for your time and help!

İlayda

A tutorial to customization of plots in Python?

Hi Hengchi!

While getting to use the pipeline more, I noticed that I am often tempted to tweak many aspects of the plots output at every stage. To name a few examples:

  • Change X axis limits in Ks distribution plots (there are options for that in wgd ksd and wgd viz but they do not seem to work)
  • Modify the order of chromosomes in the synteny dotplots, change the range of Ks values to be colored
  • Simply change the device size, e.g. to achieve 1:1 aspect ratio for self-synteny dotplots

I think it might be an overkill to have all of the plotting options as the wgd command arguments, but is it maybe not so difficult to recreate the plots with some self-tinkered options using Python? So far I had to reinvent some of the wheels myself.

Maybe it would be sufficient (and actually very cool) to have some of the simple tweaks of the main plot types (any of those mentioned above would work I guess) in a tutorial that would show how to use the Python functions of the pipeline given the pre-generated results.

Cheers,
Nikita

wgd-1.1.1/wgd/codeml.py:131: SettingWithCopyWarning:

Hi,
I got this error, when I use this code python3 wgd-1.1.1/wgd_cli.py ksd wgd_dmd/newCDS.fa.mcl newCDS.fa
How can I fix this,

/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/codeml.py:131: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
results_dict['Ks'][gene_2][gene_1] = ks_value
/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/codeml.py:133: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
results_dict['Ka'][gene_2][gene_1] = ka_value
/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/codeml.py:135: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
results_dict['Omega'][gene_2][gene_1] = w
/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/codeml.py:130: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
results_dict['Ks'][gene_1][gene_2] = ks_value
/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/codeml.py:132: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
results_dict['Ka'][gene_1][gene_2] = ka_value
/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/codeml.py:134: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
results_dict['Omega'][gene_1][gene_2] = w
2024-01-05 16:26:57: INFO Performing analysis on gene family GF_000021
/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/codeml.py:131: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
results_dict['Ks'][gene_2][gene_1] = ks_value
/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/codeml.py:133: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
results_dict['Ka'][gene_2][gene_1] = ka_value
/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/codeml.py:135: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
results_dict['Omega'][gene_2][gene_1] = w
/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/codeml.py:130: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
results_dict['Ks'][gene_1][gene_2] = ks_value
/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/codeml.py:132: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
results_dict['Ka'][gene_1][gene_2] = ka_value
/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/codeml.py:134: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
results_dict['Omega'][gene_1][gene_2] = w
2024-01-05 16:28:06: INFO Performing analysis on gene family GF_000022multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/joblib/_parallel_backends.py", line 350, in call
return self.func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/joblib/parallel.py", line 131, in call
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/usr/local/lib/python3.8/dist-packages/joblib/parallel.py", line 131, in
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/ks_distribution.py", line 307, in analyse_family
out = _calculate_weighted_ks(
File "/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/ks_distribution.py", line 197, in _calculate_weighted_ks
if pairwise_estimates['Ks'].iloc[i, j] > 5:
TypeError: '>' not supported between instances of 'str' and 'int'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.8/dist-packages/joblib/_parallel_backends.py", line 359, in call
raise TransportableException(text, e_type)
joblib.my_exceptions.TransportableException: TransportableException


TypeError Fri Jan 5 16:31:01 2024
PID: 4726 Python 3.8.10: /usr/bin/python3
...........................................................................
/usr/local/lib/python3.8/dist-packages/joblib/parallel.py in call(self=<joblib.parallel.BatchedCalls object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
self.items = [(, ('GF_000007', {'ptg000001l.2310': 'MSPLSTLLLIISLALISTFVAADPDSLQDICVADYTTGIKVNGYPCKE...AFNSQLPGTQSLATTLFGASPQVPDNVLTKAFKIGTKEVDKIKSRFVAK', 'ptg000001l.2311': 'MASIATLLLLSFALFSTSIFADPDSLQDICVADLTSGTKLNGFPCKAN...AFNSQLPGTQSLGTTLFAATPQVPDNVLSKAFKISTKEVEIIKYKFAAK', 'ptg000001l.2312': 'MASLATLVLISFALFSTSFATDPDSLQDICVADLSGVKLNGFPCKETA...SAFNSQLPGTQSIATTLFGASPEVPDNVLAKAFKIDTKTVDQIKSSFAA', 'ptg000001l.2313': 'MASFATILLLSFALFSTSFATDADSLQDICVADLASGVKLNGYPCKET...AFNSQLPGTQSIATTLFGASPQVPSNVLSKAFKISQAEVDIIKFKFLVK', 'ptg000003l.1107': 'MIFPIFIFSLLLSSSYALTQDFCVGDLSLPDAPCGFPCKKVAKVNEND...VSFSSHNPGLQILDFALFANDLPSELVEKTTFLDDVQVKKLKKVLGGTG', 'ptg000003l.1379': 'MGSFYNLLASFFLLAFAFSPLANASCNGPLQDFCVAIDEPNNASYVNG...GFNHEFPGISRHGNSLFDAKPSINYKILMRGLKLDKATEELAEGIPSGA', 'ptg000003l.1695': 'MEKRNNSNPAANATCDPGPMQDFCVGINRTYKGAFVNGEFCKNPKEVT...QNPGLLLIPNSVFQTYPPINTSILARTYKIPVKLAMRIQRSFQAEPYKG', 'ptg000003l.1887': 'MESKALACLLIFAAIYNAFAYDPDSLQDLCVANTSSSIKVNGFVCKAE...NGQLPGTQSIALTLFTATPEVPDNVLTKAFQIGTKEVDKIKSKLAPKKT', 'ptg000003l.2099': 'MASTFLKYTILSIIVAILTSRMIQASDPNILSDFLPQNTTSPDASFFT...SCFGSANAGTVSVPTTVFGTGIDAGILAKAFKTDLSTIQKIKAGLAAKA', 'ptg000003l.2462': 'MYSFSRPLALDPSYPLTFQLLFHHPPVNVFLLTPSSSSSSSSSKFSNY...NSQLPGTQSIGLSLFAATPEVPDNVLSRAFQMGTKQIDKLKTKFAPKKT', ...}, {'ptg000001l.100': 'ATGGTGTCTAGCATCGATGAAAACGAAGTTTACGCTGACTTCGCTAGA...CGTAGATTCACATTCTTGCAAGATGTATATTTTGAAGCCAATTTCTTGA', 'ptg000001l.1000': 'ATGATTGGTCCCTCTTTACAAGCTCTTGTGAAGCAACAACATGGTGTC...ATTAAGAAAAGCGCCAGCTCATAGTTTAGTCGAAGTCAGAACTTCGTGA', 'ptg000001l.1001': 'ATGCGGCTTCAGTTGTCGCCTAGTATGAGAAGCATAACGATATCGAGC...GTCGAAGCACAAAAGAGGAAGGCCGGGGCAGCTCTTCTCTAAAGGCTAG', 'ptg000001l.1002': 'ATGGCAGGCGTTCAAGACCAGTTAGAGATTAAGTTTAGGTTGACTGAC...TCCTTTGGAAAAAGGTACAAACTTTTATCTGATCATTCACATTGAATAA', 'ptg000001l.1003': 'ATGGCGACATCATCGTTTAGTGGCCTGAATTCTCCTCTATACCATTCT...AGGTGCCTTGGATGAAAAAAACAAGGGGGCCTTTTCAGGGGCCGTGTAA', 'ptg000001l.1004': 'ATGGCTGCAAATACTTTTATGTCGTATGCTGTGGATAATAAATCTGAT...TGGATTAAACTTTGATGCAGAAGAAGAAGGCTCTAATCAAAATACATGA', 'ptg000001l.1005': 'ATGGCTGAGGACGGTATTGGTTTGCCAGCGGCTTCAGAAAAAAATAAG...AACAGTTATCTCCCTCCAAGAAAAAAGGCATCCTTCCGTGAACACTTGA', 'ptg000001l.1006': 'ATGGTGGCAACCATTGCTGTGTCTTGTTCGGAATTCATAAATTTTCAA...AGTTTCGGTTGGAGCCGTGAAAAAGACGGTCTCAAGCAAGAAACGGTAG', 'ptg000001l.1007': 'ATGTGGAATTTTGCATCTAATTGCATAGCTGGAAATATTGGATCAAAA...TTTTGCATATCCTAGTTTGGATTGGCTTGTCCGAGAAATCGTCCCTTAG', 'ptg000001l.1008': 'ATGTTTCCTATTTTACTCTTCCTCCTCACAGTTACTATTTCCACCACC...AAGGCTGCAAGTTGGGGCCTCGGATGATGAAAGTTCATCTTCATCCTAG', ...}, '/var/lib/condor/execute/slot1/dir_38140/ks_tmp.3c8e141cb9cee6', 'codeml', False, 1, 100, 'fasttree', 'mafft', '/var/lib/condor/execute/slot1/dir_38140/wgd_ksd'), {})]
132
133 def len(self):
134 return self._size
135

...........................................................................
/usr/local/lib/python3.8/dist-packages/joblib/parallel.py in (.0=<list_iterator object>)TypeError: '>' not supported between instances of 'str' and 'int'


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "wgd-1.1.1/wgd_cli.py", line 1447, in
cli()
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1259, in invoke
return process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "wgd-1.1.1/wgd_cli.py", line 625, in ksd
ksd
(
File "wgd-1.1.1/wgd_cli.py", line 763, in ksd_
results = ks_analysis_paranome(
File "/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/ks_distribution.py", line 641, in ks_analysis_paranome
Parallel(n_jobs=n_threads)(delayed(analysis_function)(
File "/usr/local/lib/python3.8/dist-packages/joblib/parallel.py", line 789, in call
self.retrieve()
File "/usr/local/lib/python3.8/dist-packages/joblib/parallel.py", line 740, in retrieve
raise exception
joblib.my_exceptions.JoblibTypeError: JoblibTypeError/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/ks_distribution.py in analyse_family(family_id='GF_000007', family={'ptg000001l.2310': 'MSPLSTLLLIISLALISTFVAADPDSLQDICVADYTTGIKVNGYPCKE...AFNSQLPGTQSLATTLFGASPQVPDNVLTKAFKIGTKEVDKIKSRFVAK', 'ptg000001l.2311': 'MASIATLLLLSFALFSTSIFADPDSLQDICVADLTSGTKLNGFPCKAN...AFNSQLPGTQSLGTTLFAATPQVPDNVLSKAFKISTKEVEIIKYKFAAK', 'ptg000001l.2312': 'MASLATLVLISFALFSTSFATDPDSLQDICVADLSGVKLNGFPCKETA...SAFNSQLPGTQSIATTLFGASPEVPDNVLAKAFKIDTKTVDQIKSSFAA', 'ptg000001l.2313': 'MASFATILLLSFALFSTSFATDADSLQDICVADLASGVKLNGYPCKET...AFNSQLPGTQSIATTLFGASPQVPSNVLSKAFKISQAEVDIIKFKFLVK', 'ptg000003l.1107': 'MIFPIFIFSLLLSSSYALTQDFCVGDLSLPDAPCGFPCKKVAKVNEND...VSFSSHNPGLQILDFALFANDLPSELVEKTTFLDDVQVKKLKKVLGGTG', 'ptg000003l.1379': 'MGSFYNLLASFFLLAFAFSPLANASCNGPLQDFCVAIDEPNNASYVNG...GFNHEFPGISRHGNSLFDAKPSINYKILMRGLKLDKATEELAEGIPSGA', 'ptg000003l.1695': 'MEKRNNSNPAANATCDPGPMQDFCVGINRTYKGAFVNGEFCKNPKEVT...QNPGLLLIPNSVFQTYPPINTSILARTYKIPVKLAMRIQRSFQAEPYKG', 'ptg000003l.1887': 'MESKALACLLIFAAIYNAFAYDPDSLQDLCVANTSSSIKVNGFVCKAE...NGQLPGTQSIALTLFTATPEVPDNVLTKAFQIGTKEVDKIKSKLAPKKT', 'ptg000003l.2099': 'MASTFLKYTILSIIVAILTSRMIQASDPNILSDFLPQNTTSPDASFFT...SCFGSANAGTVSVPTTVFGTGIDAGILAKAFKTDLSTIQKIKAGLAAKA', 'ptg000003l.2462': 'MYSFSRPLALDPSYPLTFQLLFHHPPVNVFLLTPSSSSSSSSSKFSNY...NSQLPGTQSIGLSLFAATPEVPDNVLSRAFQMGTKQIDKLKTKFAPKKT', ...}, nucleotide={'ptg000001l.100': 'ATGGTGTCTAGCATCGATGAAAACGAAGTTTACGCTGACTTCGCTAGA...CGTAGATTCACATTCTTGCAAGATGTATATTTTGAAGCCAATTTCTTGA', 'ptg000001l.1000': 'ATGATTGGTCCCTCTTTACAAGCTCTTGTGAAGCAACAACATGGTGTC...ATTAAGAAAAGCGCCAGCTCATAGTTTAGTCGAAGTCAGAACTTCGTGA', 'ptg000001l.1001': 'ATGCGGCTTCAGTTGTCGCCTAGTATGAGAAGCATAACGATATCGAGC...GTCGAAGCACAAAAGAGGAAGGCCGGGGCAGCTCTTCTCTAAAGGCTAG', 'ptg000001l.1002': 'ATGGCAGGCGTTCAAGACCAGTTAGAGATTAAGTTTAGGTTGACTGAC...TCCTTTGGAAAAAGGTACAAACTTTTATCTGATCATTCACATTGAATAA', 'ptg000001l.1003': 'ATGGCGACATCATCGTTTAGTGGCCTGAATTCTCCTCTATACCATTCT...AGGTGCCTTGGATGAAAAAAACAAGGGGGCCTTTTCAGGGGCCGTGTAA', 'ptg000001l.1004': 'ATGGCTGCAAATACTTTTATGTCGTATGCTGTGGATAATAAATCTGAT...TGGATTAAACTTTGATGCAGAAGAAGAAGGCTCTAATCAAAATACATGA', 'ptg000001l.1005': 'ATGGCTGAGGACGGTATTGGTTTGCCAGCGGCTTCAGAAAAAAATAAG...AACAGTTATCTCCCTCCAAGAAAAAAGGCATCCTTCCGTGAACACTTGA', 'ptg000001l.1006': 'ATGGTGGCAACCATTGCTGTGTCTTGTTCGGAATTCATAAATTTTCAA...AGTTTCGGTTGGAGCCGTGAAAAAGACGGTCTCAAGCAAGAAACGGTAG', 'ptg000001l.1007': 'ATGTGGAATTTTGCATCTAATTGCATAGCTGGAAATATTGGATCAAAA...TTTTGCATATCCTAGTTTGGATTGGCTTGTCCGAGAAATCGTCCCTTAG', 'ptg000001l.1008': 'ATGTTTCCTATTTTACTCTTCCTCCTCACAGTTACTATTTCCACCACC...AAGGCTGCAAGTTGGGGCCTCGGATGATGAAAGTTCATCTTCATCCTAG', ...}, tmp='/var/lib/condor/execute/slot1/dir_38140/ks_tmp.3c8e141cb9cee6', codeml=<wgd.codeml.Codeml object>, preserve=False, times=1, min_length=100, method='fasttree', aligner='mafft', output_dir='/var/lib/condor/execute/slot1/dir_38140/wgd_ksd')
302 results_dict, msa=msa_path_protein, method="alc")
303 else:
304 clustering, pairwise_distances, tree_path = _weighting(
305 results_dict, msa=msa_path_protein, method=method)
306 if clustering is not None:
--> 307 out = _calculate_weighted_ks(
out = undefined
clustering = array([[2.00000000e+01, 1.80000000e+01, 5.845136...1.23000000e+02, 1.64341494e+00, 6.30000000e+01]])
results_dict = {'Ka': ptg000001l.2310 ptg000001l.2311 ... 0.0698 0.0

[63 rows x 63 columns], 'Ks': ptg000001l.2310 ptg000001l.2311 ... 0.7065 0.0

[63 rows x 63 columns], 'Omega': ptg000001l.2310 ptg000001l.2311 ... 0.0988 0.0

[63 rows x 63 columns]}
pairwise_distances = {0: {0: 0.0, 1: 0.158656603, 2: 0.19943205, 3: 0.22707782199999998, 4: 1.17499233, 5: 1.3474487700000002, 6: 1.342737019, 7: 0.370011188, 8: 1.2996123169999998, 9: 0.39746069300000003, ...}, 1: {0: 0.158656603, 1: 0.0, 2: 0.144857715, 3: 0.172503487, 4: 1.168688817, 5: 1.3411452570000002, 6: 1.336433506, 7: 0.363707675, 8: 1.2933088039999998, 9: 0.39115718000000005, ...}, 2: {0: 0.19943205, 1: 0.144857715, 2: 0.0, 3: 0.148519106, 4: 1.209464264, 5: 1.381920704, 6: 1.377208953, 7: 0.40448312200000003, 8: 1.334084251, 9: 0.431932627, ...}, 3: {0: 0.22707782199999998, 1: 0.172503487, 2: 0.148519106, 3: 0.0, 4: 1.237110036, 5: 1.4095664760000002, 6: 1.404854725, 7: 0.432128894, 8: 1.3617300229999998, 9: 0.459578399, ...}, 4: {0: 1.17499233, 1: 1.168688817, 2: 1.2094642640000002, 3: 1.2371100359999998, 4: 0.0, 5: 1.7149128360000003, 6: 1.7102010850000002, 7: 1.179148544, 8: 1.667076383, 9: 1.2065980490000001, ...}, 5: {0: 1.34744877, 1: 1.3411452570000002, 2: 1.3819207039999999, 3: 1.409566476, 4: 1.714912836, 5: 0.0, 6: 1.066055209, 7: 1.351604984, 8: 1.6259631669999999, 9: 1.3790544889999998, ...}, 6: {0: 1.342737019, 1: 1.3364335059999999, 2: 1.377208953, 3: 1.404854725, 4: 1.710201085, 5: 1.066055209, 6: 0.0, 7: 1.346893233, 8: 1.6212514159999998, 9: 1.374342738, ...}, 7: {0: 0.370011188, 1: 0.363707675, 2: 0.40448312200000003, 3: 0.432128894, 4: 1.179148544, 5: 1.3516049840000002, 6: 1.346893233, 7: 0.0, 8: 1.3037685309999998, 9: 0.266968891, ...}, 8: {0: 1.2996123170000002, 1: 1.2933088040000003, 2: 1.3340842510000002, 3: 1.3617300230000002, 4: 1.6670763830000002, 5: 1.6259631670000003, 6: 1.6212514160000002, 7: 1.3037685310000002, 8: 0.0, 9: 1.331218036, ...}, 9: {0: 0.3974606930000001, 1: 0.39115718, 2: 0.431932627, 3: 0.4595783990000001, 4: 1.2065980490000001, 5: 1.3790544890000003, 6: 1.3743427380000002, 7: 0.26696889100000004, 8: 1.3312180359999999, 9: 0.0, ...}, ...}
family_id = 'GF_000007'
308 clustering, results_dict, pairwise_distances, family_id
309 )
310 out = add_alignment_stats_(out, stats)
311 logging.debug(out)

...........................................................................
/var/lib/condor/execute/slot1/dir_38140/wgd-1.1.1/wgd/ks_distribution.py in _calculate_weighted_ks(clustering=array([[2.00000000e+01, 1.80000000e+01, 5.845136...1.23000000e+02, 1.64341494e+00, 6.30000000e+01]]), pairwise_estimates={'Ka': ptg000001l.2310 ptg000001l.2311 ... 0.0698 0.0

[63 rows x 63 columns], 'Ks': ptg000001l.2310 ptg000001l.2311 ... 0.7065 0.0

[63 rows x 63 columns], 'Omega': ptg000001l.2310 ptg000001l.2311 ... 0.0988 0.0

[63 rows x 63 columns]}, pairwise_distances={0: {0: 0.0, 1: 0.158656603, 2: 0.19943205, 3: 0.22707782199999998, 4: 1.17499233, 5: 1.3474487700000002, 6: 1.342737019, 7: 0.370011188, 8: 1.2996123169999998, 9: 0.39746069300000003, ...}, 1: {0: 0.158656603, 1: 0.0, 2: 0.144857715, 3: 0.172503487, 4: 1.168688817, 5: 1.3411452570000002, 6: 1.336433506, 7: 0.363707675, 8: 1.2933088039999998, 9: 0.39115718000000005, ...}, 2: {0: 0.19943205, 1: 0.144857715, 2: 0.0, 3: 0.148519106, 4: 1.209464264, 5: 1.381920704, 6: 1.377208953, 7: 0.40448312200000003, 8: 1.334084251, 9: 0.431932627, ...}, 3: {0: 0.22707782199999998, 1: 0.172503487, 2: 0.148519106, 3: 0.0, 4: 1.237110036, 5: 1.4095664760000002, 6: 1.404854725, 7: 0.432128894, 8: 1.3617300229999998, 9: 0.459578399, ...}, 4: {0: 1.17499233, 1: 1.168688817, 2: 1.2094642640000002, 3: 1.2371100359999998, 4: 0.0, 5: 1.7149128360000003, 6: 1.7102010850000002, 7: 1.179148544, 8: 1.667076383, 9: 1.2065980490000001, ...}, 5: {0: 1.34744877, 1: 1.3411452570000002, 2: 1.3819207039999999, 3: 1.409566476, 4: 1.714912836, 5: 0.0, 6: 1.066055209, 7: 1.351604984, 8: 1.6259631669999999, 9: 1.3790544889999998, ...}, 6: {0: 1.342737019, 1: 1.3364335059999999, 2: 1.377208953, 3: 1.404854725, 4: 1.710201085, 5: 1.066055209, 6: 0.0, 7: 1.346893233, 8: 1.6212514159999998, 9: 1.374342738, ...}, 7: {0: 0.370011188, 1: 0.363707675, 2: 0.40448312200000003, 3: 0.432128894, 4: 1.179148544, 5: 1.3516049840000002, 6: 1.346893233, 7: 0.0, 8: 1.3037685309999998, 9: 0.266968891, ...}, 8: {0: 1.2996123170000002, 1: 1.2933088040000003, 2: 1.3340842510000002, 3: 1.3617300230000002, 4: 1.6670763830000002, 5: 1.6259631670000003, 6: 1.6212514160000002, 7: 1.3037685310000002, 8: 0.0, 9: 1.331218036, ...}, 9: {0: 0.3974606930000001, 1: 0.39115718, 2: 0.431932627, 3: 0.4595783990000001, 4: 1.2065980490000001, 5: 1.3790544890000003, 6: 1.3743427380000002, 7: 0.26696889100000004, 8: 1.3312180359999999, 9: 0.0, ...}, ...}, family_id='GF_000007')
192 pairwise_estimates['Ka'].iloc[i, j],
193 pairwise_estimates['Omega'].iloc[i, j],
194 distance, grouping_node
195 ]
196
--> 197 if pairwise_estimates['Ks'].iloc[i, j] > 5:
pairwise_estimates.iloc = undefined
i = 20
j = 18
198 out.add(grouping_node)
199
200 df = pd.DataFrame.from_dict(weights, orient='index')
201 df.columns = ['Paralog1', 'Paralog2', 'Family',

TypeError: '>' not supported between instances of 'str' and 'int'


RE: dating tree

Hello,

You're providing the following dating tree:

17 1
((((Potamogeton_acutifolius,(Spirodela_intermedia,Amorphophallus_konjac)),(Acanthochlamys_bracteata,(Dioscorea_alata,Dioscorea_rotundata))'>0.5600<1.2863')'>0.8360<1.2863',(Acorus_americanus,Acorus_tatarinowii))'>0.8360<1.2863',((((Tetracentron_sinense,Trochodendron_aralioides),(Buxus_austroyunnanensis,Buxus_sinica))'>1.1080<1.2863',(Nelumbo_nucifera,(Telopea_speciosissima,Protea_cynaroides)))'>1.1080<1.2863',(Aquilegia_coerulea_ap1,Aquilegia_coerulea_ap2))'>1.1080<1.2863')'>1.2720<2.4720';

I understand that I need to replace the "Aquilegia_coreulea" branches with my species. but what about the branch lengths? Do i need to generate a new species tree including all those species plus my species?

What if I used the same species that I used in the orthology analysis?

thanks

ksd global erro

When I used ksd for multi-species comparisons, the following error occurred
Here are my commands and errors
wgd ksd wgd_globalmrbh/global_MRBH.tsv --extraparanomeks wgd_ksd/Corallodiscus_lanuginosus_014.cds.tsv.ks.tsv -sp speciestree.nw --reweight -o wgd_globalmrbh_ks --spair "./cds/Corallodiscus_lanuginosus_014.cds;./cds/Titanotrichum_oldhamii_042.cds" --spair "./cds/Corallodiscus_lanuginosus_014.cds;./cds/Gesneria_cuneifolia_074.cds" --spair "./cds/Corallodiscus_lanuginosus_014.cds;./cds/Rhynchoglossum_obliquum_123.cds" --spair "./cds/Corallodiscus_lanuginosus_014.cds;./cds/Calceolaria_pinifolia.cds" --spair "./cds/Corallodiscus_lanuginosus_014.cds;./cds/Corallodiscus_lanuginosus_014.cds" --plotkde 16:28:54 INFO This is wgd v2.0.22 cli.py:32 16:28:55 ERROR Please provide at least one sequence file
May I ask how to solve it?

peak

Hi, it's me again 🥹

wgd syn completed thanks to you. Now I try thewgd peak with this command:

wgd peak 20628_8.cds.fasta.tsv.ks.tsv -ap wgd_syn/iadhore-out/anchorpoints.txt -sm wgd_syn/iadhore-out/segments.txt -le wgd_syn/iadhore-out/list_elements.txt -mp wgd_syn/iadhore-out/multiplicon_pairs.txt

I got this error:

Screenshot 2024-06-03 at 16 14 54

I searched about this error but I couldn't find anything. Sorry about writing all the time...

Thanks.

Questions about input file sequence

Hi, @heche-psb

When using WGD analysis, can I use the protein sequence directly? Because the full-length transcriptome I constructed currently does not have a cds sequence. Is there any solution?

Error running dmd

Hello , thank you very much for the program !
However , I faced an error at the first step running wgd dmd.

Traceback (most recent call last):
  File "/home/hiuyan/.pyenv/versions/3.6.12/bin/wgd", line 11, in <module>
    load_entry_point('wgd==2.0.24', 'console_scripts', 'wgd')()
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/cli.py", line 113, in dmd
    _dmd(**kwargs)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/cli.py", line 140, in _dmd
    s[0].get_paranome(inflation=inflation, eval=eval)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/wgd/core.py", line 410, in get_paranome
    mcl_out = gf.run_mcl(inflation=inflation)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/wgd/core.py", line 495, in run_mcl
    out = sp.run(command, stdout=sp.PIPE, stderr=sp.PIPE)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mcxload': 'mcxload'

I tried with two different cds files, they all returned the same error messages. Could you please help me to fix this error? I installed wgd v2 via pip.

Thank you very much for your help!

(P.S. Sorry that I posted this question on wgdv1 github page. Sorry for the inconvenience. )

Regards,
Alex

Plot that combines paired comparisons and focal species plot

Hello! Thanks for this awesome tool. I was able to create all the plots, but wanted some help creating a big plot which essentially combines the two plots below:

All_pairs.ks.node.weighted.pdf
Mixed.ks.ivera_cds.fa.node.weighted.pdf

I saw a similar plot on the WGD Git page, but wanted some more guidance on how to go about making something like the plot below:
image

Here is my code for more context:
wgd dmd gmax_cds.fa ivera_cds.fa lupinalbus_cds.fa senna_cds.fa -f ivera_cds.fa -o wgd_dmd_all
wgd ksd wgd_dmd_all/merge_focus.tsv gmax_cds.fa ivera_cds.fa lupinalbus_cds.fa senna_cds.fa -o wgd_ksd_all
wgd viz -d wgd_ksd_all/merge_focus.tsv.ks.tsv --focus2all ivera_cds.fa --extraparanomeks inga_ksd/ivera_cds.fa.tsv.ks.tsv -sp speciestree.txt --anchorpoints wgd_syn/iadhore-out/anchorpoints.txt --plotapgmm --plotelmm -o wgd_viz_all

How to limit memory used by wgd peak?

Hello,

I amable to run the wgd pipeline on my local cluster up to the point of wgd viz, but at wgd peak the job always terminates due to high memory consumption. I have attached the .sh file I am using to submit this job as well as the error message from LSF (job_error_msg.txt) and the stderr and stdout of the job (Drosera_aliciae_wgd - Copy.%j).
files.zip

Is there a way to limit the memory consumption or otherwise troubleshoot this problem?

ks.svg of ksd output make some mistake

Hello, thank you for this really useful too. I'm having some problems with the ksd output.
Some contents of nohup is following

In-frame STOP codon in SC|017563.1.33 at position 33:36
Sequence length != multiple of 3 for SC|017563.1.393!
Invalid codon AG in SC|017563.1.393
Sequence length != multiple of 3 for SC|017563.1.531!
Invalid codon AG in SC|017563.1.531
Sequence length != multiple of 3 for SC|017563.1.578!
In-frame STOP codon in SC|017563.1.578 at position 3:6
Sequence length != multiple of 3 for SC|017563.1.660!
Invalid codon A in SC|017563.1.660
Sequence length != multiple of 3 for SC|017563.1.819!
Invalid codon G in SC|017563.1.819
Sequence length != multiple of 3 for SC|017563.1.845!
Invalid codon AG in SC|017563.1.845
Sequence length != multiple of 3 for SC|017563.1.918!
In-frame STOP codon in SC|017563.1.918 at position 21:24
Sequence length != multiple of 3 for SC|017563.1.1125!
Invalid codon G in SC|017563.1.1125
Sequence length != multiple of 3 for SC|017563.1.1131!
In-frame STOP codon in SC|017563.1.1131 at position 231:234
Sequence length != multiple of 3 for SC|017563.1.1191!
Invalid codon G in SC|017563.1.1191
Sequence length != multiple of 3 for SC|017563.1.1234!
Invalid codon G in SC|017563.1.1234
Sequence length != multiple of 3 for SC|017563.1.1389!
In-frame STOP codon in SC|017563.1.1389 at position 6:9
Sequence length != multiple of 3 for SC|017563.1.1510!
Invalid codon AC in SC|017563.1.1510
Sequence length != multiple of 3 for SC|017563.1.1543!
In-frame STOP codon in SC|017563.1.1543 at position 60:63
Sequence length != multiple of 3 for SC|017563.1.1614!
In-frame STOP codon in SC|017563.1.1614 at position 12:15
Sequence length != multiple of 3 for SC|017563.1.1622!
Invalid codon G in SC|017563.1.1622
Sequence length != multiple of 3 for SC|017563.1.1641!
In-frame STOP codon in SC|017563.1.1641 at position 60:63
Sequence length != multiple of 3 for SC|017563.1.1643!
In-frame STOP codon in SC|017563.1.1643 at position 9:12
Sequence length != multiple of 3 for SC|017563.1.1645!
Invalid codon A in SC|017563.1.1645
Sequence length != multiple of 3 for SC|017563.1.2017!
Invalid codon G in SC|017563.1.2017
Sequence length != multiple of 3 for SC|017562.1.28.1!
Invalid codon G in SC|017562.1.28.1
Sequence length != multiple of 3 for SC|017562.1.364!
In-frame STOP codon in SC|017562.1.364 at position 0:3
In-frame STOP codon in SC|017562.1.365 at position 3:6
Sequence length != multiple of 3 for SC|017562.1.398!
Invalid codon GA in SC|017562.1.398
Sequence length != multiple of 3 for SC|017562.1.431!
Invalid codon GG in SC|017562.1.431
Sequence length != multiple of 3 for SC|017562.1.438!
Invalid codon G in SC|017562.1.438
Sequence length != multiple of 3 for SC|017562.1.503!
Invalid codon G in SC|017562.1.503
Sequence length != multiple of 3 for SC|017562.1.695!
Invalid codon AA in SC|017562.1.695
Sequence length != multiple of 3 for SC|017562.1.723!
In-frame STOP codon in SC|017562.1.723 at position 15:18
Sequence length != multiple of 3 for SC|017562.1.763!
In-frame STOP codon in SC|017562.1.763 at position 39:42
Sequence length != multiple of 3 for SC|017562.1.837!
Invalid codon A in SC|017562.1.837
Sequence length != multiple of 3 for SC|017562.1.1015!
Invalid codon A in SC|017562.1.1015
Sequence length != multiple of 3 for SC|017562.1.1019!
In-frame STOP codon in SC|017562.1.1019 at position 0:3
Sequence length != multiple of 3 for SC|017562.1.1055!
Invalid codon CT in SC|017562.1.1055
Sequence length != multiple of 3 for SC|017562.1.1128!
Invalid codon G in SC|017562.1.1128
Sequence length != multiple of 3 for SC|017562.1.1173!
In-frame STOP codon in SC|017562.1.1173 at position 18:21
Sequence length != multiple of 3 for SC|017562.1.1383!
Invalid codon G in SC|017562.1.1383
Sequence length != multiple of 3 for SC|017562.1.1392!
Invalid codon AA in SC|017562.1.1392
Sequence length != multiple of 3 for SC|017562.1.1417!
In-frame STOP codon in SC|017562.1.1417 at position 123:126
Sequence length != multiple of 3 for SC|017562.1.1518!
Invalid codon G in SC|017562.1.1518
Sequence length != multiple of 3 for SC|017562.1.1629!
Invalid codon G in SC|017562.1.1629
Sequence length != multiple of 3 for SC|017562.1.1794!
Invalid codon A in SC|017562.1.1794
Sequence length != multiple of 3 for SC|017562.1.1925!
Invalid codon G in SC|017562.1.1925
Sequence length != multiple of 3 for SC|017562.1.2092.1!
Invalid codon AG in SC|017562.1.2092.1
Sequence length != multiple of 3 for SC|017562.1.2356!
In-frame STOP codon in SC|017562.1.2356 at position 603:606
Sequence length != multiple of 3 for SC|017562.1.2503!
Invalid codon GA in SC|017562.1.2503
Sequence length != multiple of 3 for SC|017558.1.4!
Invalid codon G in SC|017558.1.4
Sequence length != multiple of 3 for SC|017558.1.10!
Invalid codon G in SC|017558.1.10
Sequence length != multiple of 3 for SC|017558.1.125!
Invalid codon AC in SC|017558.1.125
Sequence length != multiple of 3 for SC|017558.1.126!
In-frame STOP codon in SC|017558.1.126 at position 1314:1317
Sequence length != multiple of 3 for SC|017558.1.130!
Invalid codon G in SC|017558.1.130
Sequence length != multiple of 3 for SC|017558.1.202!
In-frame STOP codon in SC|017558.1.202 at position 3:6
Sequence length != multiple of 3 for SC|017558.1.286!
Invalid codon G in SC|017558.1.286
Sequence length != multiple of 3 for SC|017558.1.309!
Invalid codon G in SC|017558.1.309
Sequence length != multiple of 3 for SC|017558.1.384!
Invalid codon A in SC|017558.1.384
Sequence length != multiple of 3 for SC|017558.1.484!
Invalid codon G in SC|017558.1.484
Sequence length != multiple of 3 for SC|017558.1.523!
Invalid codon A in SC|017558.1.523
Sequence length != multiple of 3 for SC|017558.1.613!
Invalid codon G in SC|017558.1.613
Sequence length != multiple of 3 for SC|017558.1.729!
Invalid codon A in SC|017558.1.729
Sequence length != multiple of 3 for SC|017558.1.738!
Invalid codon AA in SC|017558.1.738
Sequence length != multiple of 3 for SC|017558.1.739!
Invalid codon G in SC|017558.1.739
Sequence length != multiple of 3 for SC|017558.1.858!
In-frame STOP codon in SC|017558.1.858 at position 30:33
Sequence length != multiple of 3 for SC|017558.1.954!
Invalid codon AA in SC|017558.1.954
Sequence length != multiple of 3 for SC|017558.1.1147!
In-frame STOP codon in SC|017558.1.1147 at position 24:27
Sequence length != multiple of 3 for SC|017558.1.1148!
Invalid codon G in SC|017558.1.1148
Sequence length != multiple of 3 for SC|017558.1.1248!
Invalid codon A in SC|017558.1.1248
Sequence length != multiple of 3 for SC|017558.1.1342!
In-frame STOP codon in SC|017558.1.1342 at position 3:6
Sequence length != multiple of 3 for SC|017558.1.1373!
Invalid codon G in SC|017558.1.1373
Sequence length != multiple of 3 for SC|017558.1.1464!
Invalid codon GA in SC|017558.1.1464
Sequence length != multiple of 3 for SC|017558.1.1465!
Invalid codon AA in SC|017558.1.1465
Sequence length != multiple of 3 for SC|017558.1.1476!
Invalid codon A in SC|017558.1.1476
Sequence length != multiple of 3 for SC|017558.1.1507!
In-frame STOP codon in SC|017558.1.1507 at position 384:387
Sequence length != multiple of 3 for SC|017558.1.1528!
Invalid codon AA in SC|017558.1.1528
Sequence length != multiple of 3 for SC|017558.1.1555!
In-frame STOP codon in SC|017558.1.1555 at position 33:36
Sequence length != multiple of 3 for SC|017558.1.1560!
Invalid codon CG in SC|017558.1.1560
Sequence length != multiple of 3 for SC|017558.1.1591!
Invalid codon TG in SC|017558.1.1591
Sequence length != multiple of 3 for SC|017558.1.1686!
Invalid codon AA in SC|017558.1.1686
Sequence length != multiple of 3 for SC|017558.1.1810!
Invalid codon AG in SC|017558.1.1810
Sequence length != multiple of 3 for SC|017558.1.1822!
Invalid codon A in SC|017558.1.1822
Sequence length != multiple of 3 for SC|017558.1.2121!
In-frame STOP codon in SC|017558.1.2121 at position 27:30
Sequence length != multiple of 3 for SC|017558.1.2144!
Invalid codon AC in SC|017558.1.2144
Sequence length != multiple of 3 for SC|017558.1.2186!
Invalid codon G in SC|017558.1.2186
Sequence length != multiple of 3 for SC|017558.1.2234!
Invalid codon CA in SC|017558.1.2234
Sequence length != multiple of 3 for SC|017558.1.2242!
Invalid codon AG in SC|017558.1.2242
Sequence length != multiple of 3 for SC|017558.1.2297!
Invalid codon G in SC|017558.1.2297
Sequence length != multiple of 3 for SC|017558.1.2415!
Invalid codon G in SC|017558.1.2415
Sequence length != multiple of 3 for SC|017558.1.2459!
In-frame STOP codon in SC|017558.1.2459 at position 93:96
Sequence length != multiple of 3 for SC|017558.1.2697!
Invalid codon TA in SC|017558.1.2697
Sequence length != multiple of 3 for SC|017558.1.2711!
In-frame STOP codon in SC|017558.1.2711 at position 150:153
Sequence length != multiple of 3 for SC|017558.1.2786!
Invalid codon AT in SC|017558.1.2786
Sequence length != multiple of 3 for SC|017558.1.2859!
In-frame STOP codon in SC|017558.1.2859 at position 39:42
In-frame STOP codon in SC|017571.1.40 at position 153:156
Sequence length != multiple of 3 for SC|017571.1.133!
In-frame STOP codon in SC|017571.1.133 at position 15:18
Sequence length != multiple of 3 for SC|017571.1.189!
Invalid codon GT in SC|017571.1.189
Sequence length != multiple of 3 for SC|017571.1.441!
Invalid codon T in SC|017571.1.441
Sequence length != multiple of 3 for SC|017571.1.464.1!
Invalid codon G in SC|017571.1.464.1
Sequence length != multiple of 3 for SC|017571.1.642!
Invalid codon A in SC|017571.1.642
Sequence length != multiple of 3 for SC|017571.1.645!
Invalid codon AG in SC|017571.1.645
Sequence length != multiple of 3 for SC|017571.1.646!
Invalid codon GG in SC|017571.1.646
Sequence length != multiple of 3 for SC|017571.1.693!
Invalid codon G in SC|017571.1.693
Sequence length != multiple of 3 for SC|017571.1.718!
Invalid codon G in SC|017571.1.718
Sequence length != multiple of 3 for SC|017571.1.782!
Invalid codon TT in SC|017571.1.782
Sequence length != multiple of 3 for SC|017571.1.1016!
Invalid codon G in SC|017571.1.1016
Sequence length != multiple of 3 for SC|017571.1.1056!
In-frame STOP codon in SC|017571.1.1056 at position 261:264
Sequence length != multiple of 3 for SC|017571.1.1124!
Invalid codon G in SC|017571.1.1124
Sequence length != multiple of 3 for SC|017571.1.1210!
Invalid codon AG in SC|017571.1.1210
Sequence length != multiple of 3 for SC|017571.1.1257!
In-frame STOP codon in SC|017571.1.1257 at position 156:159
Sequence length != multiple of 3 for SC|017571.1.1259!
In-frame STOP codon in SC|017571.1.1259 at position 285:288
Sequence length != multiple of 3 for SC|017571.1.1260!
Invalid codon G in SC|017571.1.1260
Sequence length != multiple of 3 for SC|017571.1.1298!
Invalid codon A in SC|017571.1.1298
In-frame STOP codon in SC|017571.1.1319 at position 138:141
Sequence length != multiple of 3 for SC|017571.1.1396!
In-frame STOP codon in SC|017571.1.1396 at position 612:615
Sequence length != multiple of 3 for SC|017571.1.1397!
In-frame STOP codon in SC|017571.1.1397 at position 69:72
Sequence length != multiple of 3 for SC|017571.1.1414!
In-frame STOP codon in SC|017571.1.1414 at position 54:57
Sequence length != multiple of 3 for SC|017571.1.1562!
Invalid codon A in SC|017571.1.1562
Sequence length != multiple of 3 for SC|017571.1.1983!
Invalid codon AG in SC|017571.1.1983
Sequence length != multiple of 3 for SC|017571.1.1989!
Invalid codon AG in SC|017571.1.1989
Sequence length != multiple of 3 for SC|017571.1.1991!
In-frame STOP codon in SC|017571.1.1991 at position 3:6
Sequence length != multiple of 3 for SC|017571.1.2014!
In-frame STOP codon in SC|017571.1.2014 at position 378:381
Sequence length != multiple of 3 for SC|017571.1.2016!
In-frame STOP codon in SC|017571.1.2016 at position 33:36
Sequence length != multiple of 3 for SC|017571.1.2017!
Invalid codon CT in SC|017571.1.2017
Sequence length != multiple of 3 for SC|017556.1.114!
Invalid codon A in SC|017556.1.114
Sequence length != multiple of 3 for SC|017556.1.134!
In-frame STOP codon in SC|017556.1.134 at position 309:312
Sequence length != multiple of 3 for SC|017556.1.197!
Invalid codon A in SC|017556.1.197
Sequence length != multiple of 3 for SC|017556.1.198!
In-frame STOP codon in SC|017556.1.198 at position 30:33
Sequence length != multiple of 3 for SC|017556.1.211.8!
Invalid codon A in SC|017556.1.211.8
Sequence length != multiple of 3 for SC|017556.1.216!
In-frame STOP codon in SC|017556.1.216 at position 99:102
Sequence length != multiple of 3 for SC|017556.1.230!
In-frame STOP codon in SC|017556.1.230 at position 36:39
Sequence length != multiple of 3 for SC|017556.1.272!
Invalid codon G in SC|017556.1.272
Sequence length != multiple of 3 for SC|017556.1.299!
Invalid codon AG in SC|017556.1.299
Sequence length != multiple of 3 for SC|017556.1.391!
In-frame STOP codon in SC|017556.1.391 at position 534:537
Sequence length != multiple of 3 for SC|017556.1.528!
In-frame STOP codon in SC|017556.1.528 at position 69:72
Sequence length != multiple of 3 for SC|017556.1.567!
In-frame STOP codon in SC|017556.1.567 at position 39:42
Sequence length != multiple of 3 for SC|017556.1.651!
Invalid codon AG in SC|017556.1.651
Sequence length != multiple of 3 for SC|017556.1.652!
Invalid codon G in SC|017556.1.652
Sequence length != multiple of 3 for SC|017556.1.679!
In-frame STOP codon in SC|017556.1.679 at position 15:18
Sequence length != multiple of 3 for SC|017556.1.704!
Invalid codon CT in SC|017556.1.704
Sequence length != multiple of 3 for SC|017556.1.709!
Invalid codon AA in SC|017556.1.709
Sequence length != multiple of 3 for SC|017556.1.802!
Invalid codon G in SC|017556.1.802
Sequence length != multiple of 3 for SC|017556.1.842!
Invalid codon A in SC|017556.1.842
Sequence length != multiple of 3 for SC|017556.1.943!
In-frame STOP codon in SC|017556.1.943 at position 336:339
Sequence length != multiple of 3 for SC|017556.1.1033!
In-frame STOP codon in SC|017556.1.1033 at position 36:39
Sequence length != multiple of 3 for SC|017556.1.1042!
In-frame STOP codon in SC|017556.1.1042 at position 99:102
Sequence length != multiple of 3 for SC|017556.1.1622!
Invalid codon A in SC|017556.1.1622
Sequence length != multiple of 3 for SC|017556.1.1623!
In-frame STOP codon in SC|017556.1.1623 at position 75:78
Sequence length != multiple of 3 for SC|017556.1.1624!
Invalid codon G in SC|017556.1.1624
Sequence length != multiple of 3 for SC|017556.1.1790!
Invalid codon A in SC|017556.1.1790
Sequence length != multiple of 3 for SC|017556.1.1944!
Invalid codon A in SC|017556.1.1944
Sequence length != multiple of 3 for SC|017556.1.2016!
Invalid codon AG in SC|017556.1.2016
Sequence length != multiple of 3 for SC|017556.1.2034!
Invalid codon AG in SC|017556.1.2034
Sequence length != multiple of 3 for SC|017556.1.2131!
Invalid codon C in SC|017556.1.2131
Sequence length != multiple of 3 for SC|017556.1.2265!
Invalid codon CT in SC|017556.1.2265
Sequence length != multiple of 3 for SC|017556.1.2417!
Invalid codon GA in SC|017556.1.2417
Sequence length != multiple of 3 for SC|017556.1.2493!
In-frame STOP codon in SC|017556.1.2493 at position 24:27
Sequence length != multiple of 3 for SC|017556.1.2591!
Invalid codon CT in SC|017556.1.2591
Sequence length != multiple of 3 for SC|017556.1.2762.1!
Invalid codon AG in SC|017556.1.2762.1
Sequence length != multiple of 3 for SC|017556.1.2913!
Invalid codon AG in SC|017556.1.2913
Sequence length != multiple of 3 for SC|017556.1.2973!
Invalid codon G in SC|017556.1.2973
Sequence length != multiple of 3 for SC|017561.1.31!
Invalid codon AA in SC|017561.1.31
Sequence length != multiple of 3 for SC|017561.1.151!
In-frame STOP codon in SC|017561.1.151 at position 3:6
Sequence length != multiple of 3 for SC|017561.1.179!
In-frame STOP codon in SC|017561.1.179 at position 27:30
Sequence length != multiple of 3 for SC|017561.1.213!
Invalid codon T in SC|017561.1.213
Sequence length != multiple of 3 for SC|017561.1.351!
Invalid codon AG in SC|017561.1.351
Sequence length != multiple of 3 for SC|017561.1.611!
Invalid codon GG in SC|017561.1.611
Sequence length != multiple of 3 for SC|017561.1.844!
Invalid codon AG in SC|017561.1.844
Sequence length != multiple of 3 for SC|017561.1.960!
In-frame STOP codon in SC|017561.1.960 at position 36:39
Sequence length != multiple of 3 for SC|017561.1.1003!
In-frame STOP codon in SC|017561.1.1003 at position 9:12
Sequence length != multiple of 3 for SC|017561.1.1516!
In-frame STOP codon in SC|017561.1.1516 at position 51:54
Sequence length != multiple of 3 for SC|017561.1.1593!
Invalid codon G in SC|017561.1.1593
Sequence length != multiple of 3 for SC|017561.1.1595!
In-frame STOP codon in SC|017561.1.1595 at position 144:147
Sequence length != multiple of 3 for SC|017561.1.1715!
In-frame STOP codon in SC|017561.1.1715 at position 3:6
Sequence length != multiple of 3 for SC|017561.1.1776!
In-frame STOP codon in SC|017561.1.1776 at position 18:21
Sequence length != multiple of 3 for SC|017561.1.1798!
Invalid codon G in SC|017561.1.1798
Sequence length != multiple of 3 for SC|017561.1.1874!
Invalid codon AG in SC|017561.1.1874
Sequence length != multiple of 3 for SC|017561.1.1883!
Invalid codon A in SC|017561.1.1883
Sequence length != multiple of 3 for SC|017561.1.2154!
Invalid codon G in SC|017561.1.2154
Sequence length != multiple of 3 for SC|017561.1.2203!
Invalid codon G in SC|017561.1.2203
Sequence length != multiple of 3 for SC|017561.1.2347!
Invalid codon AG in SC|017561.1.2347
Sequence length != multiple of 3 for SC|017561.1.2368!
In-frame STOP codon in SC|017561.1.2368 at position 75:78
Sequence length != multiple of 3 for SC|017561.1.2387!
Invalid codon G in SC|017561.1.2387
Sequence length != multiple of 3 for SC|017561.1.2391!
In-frame STOP codon in SC|017561.1.2391 at position 6:9
Sequence length != multiple of 3 for SC|017561.1.2392!
Invalid codon C in SC|017561.1.2392
Sequence length != multiple of 3 for SC|017567.1.16!
Invalid codon G in SC|017567.1.16
Sequence length != multiple of 3 for SC|017567.1.19!
Invalid codon A in SC|017567.1.19
Sequence length != multiple of 3 for SC|017567.1.60!
Invalid codon A in SC|017567.1.60
Sequence length != multiple of 3 for SC|017567.1.73!
Invalid codon TC in SC|017567.1.73
Sequence length != multiple of 3 for SC|017567.1.107!
Invalid codon G in SC|017567.1.107
Sequence length != multiple of 3 for SC|017567.1.113!
Invalid codon G in SC|017567.1.113
Sequence length != multiple of 3 for SC|017567.1.180!
Invalid codon G in SC|017567.1.180
Sequence length != multiple of 3 for SC|017567.1.181!
Invalid codon G in SC|017567.1.181
Sequence length != multiple of 3 for SC|017567.1.207!
In-frame STOP codon in SC|017567.1.207 at position 9:12
Sequence length != multiple of 3 for SC|017567.1.439!
Invalid codon G in SC|017567.1.439
Sequence length != multiple of 3 for SC|017567.1.622!
Invalid codon T in SC|017567.1.622
Sequence length != multiple of 3 for SC|017567.1.860!
In-frame STOP codon in SC|017567.1.860 at position 24:27
Sequence length != multiple of 3 for SC|017567.1.861!
In-frame STOP codon in SC|017567.1.861 at position 24:27
Sequence length != multiple of 3 for SC|017567.1.864!
In-frame STOP codon in SC|017567.1.864 at position 0:3
Sequence length != multiple of 3 for SC|017567.1.865!
In-frame STOP codon in SC|017567.1.865 at position 39:42
Sequence length != multiple of 3 for SC|017567.1.891!
Invalid codon C in SC|017567.1.891
Sequence length != multiple of 3 for SC|017567.1.897!
In-frame STOP codon in SC|017567.1.897 at position 18:21
Sequence length != multiple of 3 for SC|017567.1.1059!
Invalid codon GT in SC|017567.1.1059
Sequence length != multiple of 3 for SC|017567.1.1212!
Invalid codon CG in SC|017567.1.1212
Sequence length != multiple of 3 for SC|017567.1.1295!
In-frame STOP codon in SC|017567.1.1295 at position 3:6
Sequence length != multiple of 3 for SC|017567.1.1308!
Invalid codon TG in SC|017567.1.1308
Sequence length != multiple of 3 for SC|017567.1.1592!
In-frame STOP codon in SC|017567.1.1592 at position 6:9
Sequence length != multiple of 3 for SC|017567.1.1861!
Invalid codon CA in SC|017567.1.1861
Sequence length != multiple of 3 for SC|017568.1.20!
Invalid codon A in SC|017568.1.20
Sequence length != multiple of 3 for SC|017568.1.89!
Invalid codon AG in SC|017568.1.89
Sequence length != multiple of 3 for SC|017568.1.300!
In-frame STOP codon in SC|017568.1.300 at position 39:42
Sequence length != multiple of 3 for SC|017568.1.415!
In-frame STOP codon in SC|017568.1.415 at position 6:9
Sequence length != multiple of 3 for SC|017568.1.453!
In-frame STOP codon in SC|017568.1.453 at position 3:6
Sequence length != multiple of 3 for SC|017568.1.645!
In-frame STOP codon in SC|017568.1.645 at position 15:18
Sequence length != multiple of 3 for SC|017568.1.730!
Invalid codon AA in SC|017568.1.730
Sequence length != multiple of 3 for SC|017568.1.944.1!
Invalid codon A in SC|017568.1.944.1
Sequence length != multiple of 3 for SC|017568.1.1271!
Invalid codon G in SC|017568.1.1271
Sequence length != multiple of 3 for SC|017568.1.1290!
Invalid codon G in SC|017568.1.1290
Sequence length != multiple of 3 for SC|017568.1.1291!
In-frame STOP codon in SC|017568.1.1291 at position 3:6
Sequence length != multiple of 3 for SC|017568.1.1318!
In-frame STOP codon in SC|017568.1.1318 at position 9:12
Sequence length != multiple of 3 for SC|017568.1.1795!
In-frame STOP codon in SC|017568.1.1795 at position 48:51
Sequence length != multiple of 3 for SC|017568.1.1857!
In-frame STOP codon in SC|017568.1.1857 at position 231:234
Sequence length != multiple of 3 for SC|017568.1.1967!
In-frame STOP codon in SC|017568.1.1967 at position 39:42
Sequence length != multiple of 3 for SC|017572.1.255!
Invalid codon A in SC|017572.1.255
Sequence length != multiple of 3 for SC|017572.1.277!
Invalid codon G in SC|017572.1.277
Sequence length != multiple of 3 for SC|017572.1.336!
In-frame STOP codon in SC|017572.1.336 at position 600:603
Sequence length != multiple of 3 for SC|017572.1.385!
In-frame STOP codon in SC|017572.1.385 at position 60:63
Sequence length != multiple of 3 for SC|017572.1.432!
Invalid codon G in SC|017572.1.432
Sequence length != multiple of 3 for SC|017572.1.532!
Invalid codon G in SC|017572.1.532
Sequence length != multiple of 3 for SC|017572.1.595!
Invalid codon A in SC|017572.1.595
Sequence length != multiple of 3 for SC|017572.1.691!
Invalid codon A in SC|017572.1.691
Sequence length != multiple of 3 for SC|017572.1.957!
Invalid codon G in SC|017572.1.957
Sequence length != multiple of 3 for SC|017572.1.1082!
In-frame STOP codon in SC|017572.1.1082 at position 48:51
Sequence length != multiple of 3 for SC|017572.1.1100!
Invalid codon G in SC|017572.1.1100
Sequence length != multiple of 3 for SC|017572.1.1141!
Invalid codon TG in SC|017572.1.1141
Sequence length != multiple of 3 for SC|017572.1.1168!
Invalid codon G in SC|017572.1.1168
Sequence length != multiple of 3 for SC|017572.1.1228!
Invalid codon G in SC|017572.1.1228
Sequence length != multiple of 3 for SC|017572.1.1257!
Invalid codon G in SC|017572.1.1257
Sequence length != multiple of 3 for SC|017572.1.1347!
In-frame STOP codon in SC|017572.1.1347 at position 111:114
Sequence length != multiple of 3 for SC|017572.1.1352!
In-frame STOP codon in SC|017572.1.1352 at position 9:12
Sequence length != multiple of 3 for SC|017572.1.1365!
Invalid codon AG in SC|017572.1.1365
Sequence length != multiple of 3 for SC|017572.1.1388!
Invalid codon A in SC|017572.1.1388
Sequence length != multiple of 3 for SC|017572.1.1644!
Invalid codon AG in SC|017572.1.1644
Sequence length != multiple of 3 for SC|017572.1.1645!
In-frame STOP codon in SC|017572.1.1645 at position 57:60
Sequence length != multiple of 3 for SC|017572.1.1649!
Invalid codon AT in SC|017572.1.1649
Sequence length != multiple of 3 for SC|017572.1.1714!
Invalid codon AA in SC|017572.1.1714
Sequence length != multiple of 3 for SC|017572.1.1724!
Invalid codon CT in SC|017572.1.1724
Sequence length != multiple of 3 for SC|017572.1.1726!
In-frame STOP codon in SC|017572.1.1726 at position 141:144
Sequence length != multiple of 3 for SC|017572.1.1742!
Invalid codon A in SC|017572.1.1742
Sequence length != multiple of 3 for SC|017572.1.1780!
Invalid codon G in SC|017572.1.1780
Sequence length != multiple of 3 for SC|017572.1.1785!
Invalid codon G in SC|017572.1.1785
Sequence length != multiple of 3 for SC|017564.1.20!
Invalid codon GA in SC|017564.1.20
Sequence length != multiple of 3 for SC|017564.1.30!
Invalid codon GA in SC|017564.1.30
Sequence length != multiple of 3 for SC|017564.1.68!
In-frame STOP codon in SC|017564.1.68 at position 93:96
Sequence length != multiple of 3 for SC|017564.1.146!
In-frame STOP codon in SC|017564.1.146 at position 54:57
Sequence length != multiple of 3 for SC|017564.1.306!
Invalid codon AA in SC|017564.1.306
Sequence length != multiple of 3 for SC|017564.1.332!
In-frame STOP codon in SC|017564.1.332 at position 42:45
Sequence length != multiple of 3 for SC|017564.1.361!
Invalid codon GA in SC|017564.1.361
Sequence length != multiple of 3 for SC|017564.1.462!
Invalid codon AG in SC|017564.1.462
Sequence length != multiple of 3 for SC|017564.1.579!
Invalid codon G in SC|017564.1.579
Sequence length != multiple of 3 for SC|017564.1.613!
In-frame STOP codon in SC|017564.1.613 at position 480:483
Sequence length != multiple of 3 for SC|017564.1.798!
Invalid codon AC in SC|017564.1.798
Sequence length != multiple of 3 for SC|017564.1.1131!
Invalid codon GG in SC|017564.1.1131
Sequence length != multiple of 3 for SC|017564.1.1227!
Invalid codon G in SC|017564.1.1227
Sequence length != multiple of 3 for SC|017564.1.1340!
Invalid codon A in SC|017564.1.1340
Sequence length != multiple of 3 for SC|017564.1.1424!
Invalid codon TC in SC|017564.1.1424
Sequence length != multiple of 3 for SC|017564.1.1486!
In-frame STOP codon in SC|017564.1.1486 at position 3:6
Sequence length != multiple of 3 for SC|017564.1.1493!
Invalid codon GT in SC|017564.1.1493
Sequence length != multiple of 3 for SC|017564.1.1612!
Invalid codon T in SC|017564.1.1612
Sequence length != multiple of 3 for SC|017564.1.1654!
Invalid codon A in SC|017564.1.1654
Sequence length != multiple of 3 for SC|017564.1.1683!
In-frame STOP codon in SC|017564.1.1683 at position 45:48
Sequence length != multiple of 3 for SC|017564.1.1755!
Invalid codon AG in SC|017564.1.1755
Sequence length != multiple of 3 for SC|017564.1.1759!
In-frame STOP codon in SC|017564.1.1759 at position 111:114
Sequence length != multiple of 3 for SC|017564.1.1812!
In-frame STOP codon in SC|017564.1.1812 at position 24:27
Sequence length != multiple of 3 for SC|017564.1.2005!
In-frame STOP codon in SC|017564.1.2005 at position 36:39
Sequence length != multiple of 3 for SC|017565.1.46!
Invalid codon AA in SC|017565.1.46
Sequence length != multiple of 3 for SC|017565.1.61!
In-frame STOP codon in SC|017565.1.61 at position 12:15
Sequence length != multiple of 3 for SC|017565.1.98!
In-frame STOP codon in SC|017565.1.98 at position 60:63
Sequence length != multiple of 3 for SC|017565.1.338!
Invalid codon GT in SC|017565.1.338
Sequence length != multiple of 3 for SC|017565.1.354!
Invalid codon A in SC|017565.1.354
Sequence length != multiple of 3 for SC|017565.1.377!
Invalid codon G in SC|017565.1.377
Sequence length != multiple of 3 for SC|017565.1.453!
Invalid codon T in SC|017565.1.453
Sequence length != multiple of 3 for SC|017565.1.523.4!
Invalid codon G in SC|017565.1.523.4
Sequence length != multiple of 3 for SC|017565.1.535!
Invalid codon G in SC|017565.1.535
Sequence length != multiple of 3 for SC|017565.1.593!
Invalid codon AT in SC|017565.1.593
Sequence length != multiple of 3 for SC|017565.1.685!
Invalid codon A in SC|017565.1.685
Sequence length != multiple of 3 for SC|017565.1.867!
Invalid codon AA in SC|017565.1.867
Sequence length != multiple of 3 for SC|017565.1.889!
Invalid codon G in SC|017565.1.889
Sequence length != multiple of 3 for SC|017565.1.1095!
In-frame STOP codon in SC|017565.1.1095 at position 183:186
Sequence length != multiple of 3 for SC|017565.1.1113!
In-frame STOP codon in SC|017565.1.1113 at position 78:81
Sequence length != multiple of 3 for SC|017565.1.1118!
Invalid codon AG in SC|017565.1.1118
Sequence length != multiple of 3 for SC|017565.1.1337!
Invalid codon GA in SC|017565.1.1337
Sequence length != multiple of 3 for SC|017565.1.1565!
Invalid codon G in SC|017565.1.1565
Sequence length != multiple of 3 for SC|017565.1.1659!
Invalid codon A in SC|017565.1.1659
Sequence length != multiple of 3 for SC|017565.1.1706!
Invalid codon G in SC|017565.1.1706
Sequence length != multiple of 3 for SC|017565.1.1884!
In-frame STOP codon in SC|017565.1.1884 at position 18:21
Sequence length != multiple of 3 for SC|017565.1.1989!
Invalid codon A in SC|017565.1.1989
Sequence length != multiple of 3 for SC|017565.1.2223!
In-frame STOP codon in SC|017565.1.2223 at position 15:18
Sequence length != multiple of 3 for SC|017560.1.109!
Invalid codon G in SC|017560.1.109
Sequence length != multiple of 3 for SC|017560.1.134!
In-frame STOP codon in SC|017560.1.134 at position 48:51
Sequence length != multiple of 3 for SC|017560.1.346!
Invalid codon G in SC|017560.1.346
Sequence length != multiple of 3 for SC|017560.1.364!
Invalid codon C in SC|017560.1.364
Sequence length != multiple of 3 for SC|017560.1.426!
Invalid codon AT in SC|017560.1.426
Sequence length != multiple of 3 for SC|017560.1.475!
Invalid codon G in SC|017560.1.475
In-frame STOP codon in SC|017560.1.499 at position 45:48
Sequence length != multiple of 3 for SC|017560.1.619!
Invalid codon AG in SC|017560.1.619
Sequence length != multiple of 3 for SC|017560.1.728!
Invalid codon G in SC|017560.1.728
Sequence length != multiple of 3 for SC|017560.1.866!
Invalid codon AT in SC|017560.1.866
Sequence length != multiple of 3 for SC|017560.1.886!
Invalid codon C in SC|017560.1.886
Sequence length != multiple of 3 for SC|017560.1.951!
Invalid codon G in SC|017560.1.951
Sequence length != multiple of 3 for SC|017560.1.1185!
Invalid codon G in SC|017560.1.1185
Sequence length != multiple of 3 for SC|017560.1.1242!
Invalid codon A in SC|017560.1.1242
Sequence length != multiple of 3 for SC|017560.1.1248.1!
Invalid codon A in SC|017560.1.1248.1
Sequence length != multiple of 3 for SC|017560.1.1413!
Invalid codon AA in SC|017560.1.1413
Sequence length != multiple of 3 for SC|017560.1.1549!
Invalid codon G in SC|017560.1.1549
Sequence length != multiple of 3 for SC|017560.1.1556!
In-frame STOP codon in SC|017560.1.1556 at position 210:213
Sequence length != multiple of 3 for SC|017560.1.1720!
Invalid codon G in SC|017560.1.1720
Sequence length != multiple of 3 for SC|017560.1.1828.2!
Invalid codon GA in SC|017560.1.1828.2
Sequence length != multiple of 3 for SC|017560.1.1888!
Invalid codon T in SC|017560.1.1888
Sequence length != multiple of 3 for SC|017560.1.1956!
Invalid codon T in SC|017560.1.1956
Sequence length != multiple of 3 for SC|017560.1.2010!
Invalid codon GA in SC|017560.1.2010
Sequence length != multiple of 3 for SC|017560.1.2088!
In-frame STOP codon in SC|017560.1.2088 at position 171:174
Sequence length != multiple of 3 for SC|017560.1.2131!
Invalid codon GA in SC|017560.1.2131
Sequence length != multiple of 3 for SC|017560.1.2182!
Invalid codon AG in SC|017560.1.2182
Sequence length != multiple of 3 for SC|017560.1.2256!
Invalid codon AG in SC|017560.1.2256
Sequence length != multiple of 3 for SC|017560.1.2308!
In-frame STOP codon in SC|017560.1.2308 at position 57:60
Sequence length != multiple of 3 for SC|017560.1.2324!
Invalid codon AC in SC|017560.1.2324
Sequence length != multiple of 3 for SC|017560.1.2359!
In-frame STOP codon in SC|017560.1.2359 at position 0:3
Sequence length != multiple of 3 for SC|017560.1.2431!
In-frame STOP codon in SC|017560.1.2431 at position 42:45
Sequence length != multiple of 3 for SC|017569.1.27!
Invalid codon AA in SC|017569.1.27
Sequence length != multiple of 3 for SC|017569.1.190!
In-frame STOP codon in SC|017569.1.190 at position 9:12
Sequence length != multiple of 3 for SC|017569.1.243!
Invalid codon GA in SC|017569.1.243
Sequence length != multiple of 3 for SC|017569.1.270!
Invalid codon GG in SC|017569.1.270
Sequence length != multiple of 3 for SC|017569.1.352!
Invalid codon GG in SC|017569.1.352
Sequence length != multiple of 3 for SC|017569.1.510!
In-frame STOP codon in SC|017569.1.510 at position 186:189
Sequence length != multiple of 3 for SC|017569.1.537!
In-frame STOP codon in SC|017569.1.537 at position 33:36
Sequence length != multiple of 3 for SC|017569.1.818!
Invalid codon AG in SC|017569.1.818
Sequence length != multiple of 3 for SC|017569.1.823!
In-frame STOP codon in SC|017569.1.823 at position 33:36
Sequence length != multiple of 3 for SC|017569.1.832!
In-frame STOP codon in SC|017569.1.832 at position 30:33
Sequence length != multiple of 3 for SC|017569.1.1211!
In-frame STOP codon in SC|017569.1.1211 at position 36:39
Sequence length != multiple of 3 for SC|017569.1.1297!
Invalid codon AA in SC|017569.1.1297
Sequence length != multiple of 3 for SC|017569.1.1411!
In-frame STOP codon in SC|017569.1.1411 at position 294:297
Sequence length != multiple of 3 for SC|017569.1.1444!
Invalid codon AA in SC|017569.1.1444
Sequence length != multiple of 3 for SC|017569.1.1474!
In-frame STOP codon in SC|017569.1.1474 at position 84:87
Sequence length != multiple of 3 for SC|017569.1.1493!
Invalid codon AC in SC|017569.1.1493
Sequence length != multiple of 3 for SC|017569.1.1579!
Invalid codon CA in SC|017569.1.1579
Sequence length != multiple of 3 for SC|017569.1.1604!
Invalid codon G in SC|017569.1.1604
Sequence length != multiple of 3 for SC|017569.1.1729!
Invalid codon TT in SC|017569.1.1729
Sequence length != multiple of 3 for SC|017569.1.1868!
In-frame STOP codon in SC|017569.1.1868 at position 195:198
Sequence length != multiple of 3 for SC|017569.1.1869!
In-frame STOP codon in SC|017569.1.1869 at position 117:120
Sequence length != multiple of 3 for SC|017569.1.1958.1!
Invalid codon G in SC|017569.1.1958.1
Sequence length != multiple of 3 for SC|017569.1.1983!
Invalid codon AG in SC|017569.1.1983
Sequence length != multiple of 3 for SC|017566.1.63!
Invalid codon G in SC|017566.1.63
Sequence length != multiple of 3 for SC|017566.1.179!
Invalid codon T in SC|017566.1.179
Sequence length != multiple of 3 for SC|017566.1.180!
In-frame STOP codon in SC|017566.1.180 at position 3:6
Sequence length != multiple of 3 for SC|017566.1.300!
Invalid codon AA in SC|017566.1.300
Sequence length != multiple of 3 for SC|017566.1.351!
Invalid codon G in SC|017566.1.351
Sequence length != multiple of 3 for SC|017566.1.360!
Invalid codon AA in SC|017566.1.360
Sequence length != multiple of 3 for SC|017566.1.460!
Invalid codon A in SC|017566.1.460
Sequence length != multiple of 3 for SC|017566.1.606!
In-frame STOP codon in SC|017566.1.606 at position 12:15
Sequence length != multiple of 3 for SC|017566.1.660!
Invalid codon AG in SC|017566.1.660
Sequence length != multiple of 3 for SC|017566.1.750!
Invalid codon AG in SC|017566.1.750
Sequence length != multiple of 3 for SC|017566.1.759!
Invalid codon G in SC|017566.1.759
Sequence length != multiple of 3 for SC|017566.1.779!
In-frame STOP codon in SC|017566.1.779 at position 18:21
Sequence length != multiple of 3 for SC|017566.1.898!
Invalid codon GG in SC|017566.1.898
Sequence length != multiple of 3 for SC|017566.1.903!
In-frame STOP codon in SC|017566.1.903 at position 24:27
Sequence length != multiple of 3 for SC|017566.1.938.1!
Invalid codon AT in SC|017566.1.938.1
Sequence length != multiple of 3 for SC|017566.1.1012!
In-frame STOP codon in SC|017566.1.1012 at position 18:21
Sequence length != multiple of 3 for SC|017566.1.1093!
Invalid codon G in SC|017566.1.1093
Sequence length != multiple of 3 for SC|017566.1.1209!
Invalid codon A in SC|017566.1.1209
Sequence length != multiple of 3 for SC|017566.1.1233!
Invalid codon TG in SC|017566.1.1233
Sequence length != multiple of 3 for SC|017566.1.1290!
In-frame STOP codon in SC|017566.1.1290 at position 546:549
Sequence length != multiple of 3 for SC|017566.1.1367!
Invalid codon AG in SC|017566.1.1367
Sequence length != multiple of 3 for SC|017566.1.1405!
In-frame STOP codon in SC|017566.1.1405 at position 3:6
Sequence length != multiple of 3 for SC|017566.1.1419.1!
Invalid codon GT in SC|017566.1.1419.1
Sequence length != multiple of 3 for SC|017566.1.1611!
Invalid codon CG in SC|017566.1.1611
Sequence length != multiple of 3 for SC|017566.1.1715!
In-frame STOP codon in SC|017566.1.1715 at position 24:27
Sequence length != multiple of 3 for SC|017566.1.1772!
Invalid codon TG in SC|017566.1.1772
Sequence length != multiple of 3 for SC|017566.1.1797!
Invalid codon AA in SC|017566.1.1797
Sequence length != multiple of 3 for SC|017566.1.1798!
In-frame STOP codon in SC|017566.1.1798 at position 3:6
Sequence length != multiple of 3 for SC|017566.1.1824!
Invalid codon C in SC|017566.1.1824
Sequence length != multiple of 3 for SC|017566.1.1840!
Invalid codon G in SC|017566.1.1840
Sequence length != multiple of 3 for SC|017573.1.33!
Invalid codon AG in SC|017573.1.33
Sequence length != multiple of 3 for SC|017573.1.140!
Invalid codon AG in SC|017573.1.140
Sequence length != multiple of 3 for SC|017573.1.210!
In-frame STOP codon in SC|017573.1.210 at position 111:114
Sequence length != multiple of 3 for SC|017573.1.216!
Invalid codon A in SC|017573.1.216
Sequence length != multiple of 3 for SC|017573.1.234!
Invalid codon TG in SC|017573.1.234
Sequence length != multiple of 3 for SC|017573.1.272!
Invalid codon A in SC|017573.1.272
Sequence length != multiple of 3 for SC|017573.1.314!
Invalid codon G in SC|017573.1.314
Sequence length != multiple of 3 for SC|017573.1.448!
Invalid codon G in SC|017573.1.448
Sequence length != multiple of 3 for SC|017573.1.450!
In-frame STOP codon in SC|017573.1.450 at position 0:3
Sequence length != multiple of 3 for SC|017573.1.613!
Invalid codon G in SC|017573.1.613
Sequence length != multiple of 3 for SC|017573.1.620!
Invalid codon AG in SC|017573.1.620
Sequence length != multiple of 3 for SC|017573.1.662!
Invalid codon AG in SC|017573.1.662
Sequence length != multiple of 3 for SC|017573.1.896.2!
Invalid codon CT in SC|017573.1.896.2
Sequence length != multiple of 3 for SC|017573.1.985!
In-frame STOP codon in SC|017573.1.985 at position 6:9
Sequence length != multiple of 3 for SC|017573.1.1025!
In-frame STOP codon in SC|017573.1.1025 at position 36:39
Sequence length != multiple of 3 for SC|017559.1.68!
Invalid codon G in SC|017559.1.68
Sequence length != multiple of 3 for SC|017559.1.92!
In-frame STOP codon in SC|017559.1.92 at position 6:9
Sequence length != multiple of 3 for SC|017559.1.131!
Invalid codon TT in SC|017559.1.131
Sequence length != multiple of 3 for SC|017559.1.274!
In-frame STOP codon in SC|017559.1.274 at position 231:234
Sequence length != multiple of 3 for SC|017559.1.442.1!
Invalid codon A in SC|017559.1.442.1
Sequence length != multiple of 3 for SC|017559.1.459!
Invalid codon GA in SC|017559.1.459
Sequence length != multiple of 3 for SC|017559.1.538!
In-frame STOP codon in SC|017559.1.538 at position 0:3
In-frame STOP codon in SC|017559.1.569 at position 0:3
Sequence length != multiple of 3 for SC|017559.1.604!
Invalid codon G in SC|017559.1.604
Sequence length != multiple of 3 for SC|017559.1.871.1!
Invalid codon G in SC|017559.1.871.1
Sequence length != multiple of 3 for SC|017559.1.980!
In-frame STOP codon in SC|017559.1.980 at position 54:57
Sequence length != multiple of 3 for SC|017559.1.1025!
Invalid codon GT in SC|017559.1.1025
Sequence length != multiple of 3 for SC|017559.1.1102!
In-frame STOP codon in SC|017559.1.1102 at position 0:3
Sequence length != multiple of 3 for SC|017559.1.1319!
In-frame STOP codon in SC|017559.1.1319 at position 0:3
Sequence length != multiple of 3 for SC|017559.1.1349.2!
Invalid codon GT in SC|017559.1.1349.2
Sequence length != multiple of 3 for SC|017559.1.1374!
In-frame STOP codon in SC|017559.1.1374 at position 225:228
Sequence length != multiple of 3 for SC|017559.1.1380!
Invalid codon AG in SC|017559.1.1380
Sequence length != multiple of 3 for SC|017559.1.1427.1!
Invalid codon TT in SC|017559.1.1427.1
Sequence length != multiple of 3 for SC|017559.1.1615!
Invalid codon A in SC|017559.1.1615
Sequence length != multiple of 3 for SC|017559.1.1789!
In-frame STOP codon in SC|017559.1.1789 at position 6:9
Sequence length != multiple of 3 for SC|017559.1.1804!
In-frame STOP codon in SC|017559.1.1804 at position 120:123
Sequence length != multiple of 3 for SC|017559.1.1915!
Invalid codon AA in SC|017559.1.1915
Sequence length != multiple of 3 for SC|017559.1.1961!
In-frame STOP codon in SC|017559.1.1961 at position 1125:1128
Sequence length != multiple of 3 for SC|017559.1.2122!
Invalid codon GA in SC|017559.1.2122
Sequence length != multiple of 3 for SC|017559.1.2167!
Invalid codon G in SC|017559.1.2167
Sequence length != multiple of 3 for SC|017557.1.39!
In-frame STOP codon in SC|017557.1.39 at position 69:72
Sequence length != multiple of 3 for SC|017557.1.190!
Invalid codon G in SC|017557.1.190
Sequence length != multiple of 3 for SC|017557.1.330!
In-frame STOP codon in SC|017557.1.330 at position 69:72
Sequence length != multiple of 3 for SC|017557.1.333!
Invalid codon G in SC|017557.1.333
Sequence length != multiple of 3 for SC|017557.1.334!
In-frame STOP codon in SC|017557.1.334 at position 12:15
Sequence length != multiple of 3 for SC|017557.1.347!
Invalid codon AG in SC|017557.1.347
Sequence length != multiple of 3 for SC|017557.1.393!
In-frame STOP codon in SC|017557.1.393 at position 108:111
Sequence length != multiple of 3 for SC|017557.1.430!
In-frame STOP codon in SC|017557.1.430 at position 24:27
Sequence length != multiple of 3 for SC|017557.1.485!
In-frame STOP codon in SC|017557.1.485 at position 78:81
Sequence length != multiple of 3 for SC|017557.1.525!
In-frame STOP codon in SC|017557.1.525 at position 33:36
Sequence length != multiple of 3 for SC|017557.1.570!
Invalid codon G in SC|017557.1.570
Sequence length != multiple of 3 for SC|017557.1.620!
Invalid codon AG in SC|017557.1.620
Sequence length != multiple of 3 for SC|017557.1.630!
In-frame STOP codon in SC|017557.1.630 at position 15:18
Sequence length != multiple of 3 for SC|017557.1.718!
Invalid codon GA in SC|017557.1.718
Sequence length != multiple of 3 for SC|017557.1.849!
Invalid codon CG in SC|017557.1.849
Sequence length != multiple of 3 for SC|017557.1.850!
Invalid codon TT in SC|017557.1.850
Sequence length != multiple of 3 for SC|017557.1.977!
Invalid codon T in SC|017557.1.977
Sequence length != multiple of 3 for SC|017557.1.1228!
In-frame STOP codon in SC|017557.1.1228 at position 18:21
Sequence length != multiple of 3 for SC|017557.1.1258!
Invalid codon GA in SC|017557.1.1258
Sequence length != multiple of 3 for SC|017557.1.1260!
In-frame STOP codon in SC|017557.1.1260 at position 6:9
Sequence length != multiple of 3 for SC|017557.1.1340!
In-frame STOP codon in SC|017557.1.1340 at position 27:30
Sequence length != multiple of 3 for SC|017557.1.1379!
Invalid codon AA in SC|017557.1.1379
Sequence length != multiple of 3 for SC|017557.1.1389!
Invalid codon G in SC|017557.1.1389
Sequence length != multiple of 3 for SC|017557.1.1478!
Invalid codon AA in SC|017557.1.1478
Sequence length != multiple of 3 for SC|017557.1.1505!
Invalid codon TT in SC|017557.1.1505
Sequence length != multiple of 3 for SC|017557.1.1537!
Invalid codon TG in SC|017557.1.1537
Sequence length != multiple of 3 for SC|017557.1.1615!
Invalid codon G in SC|017557.1.1615
Sequence length != multiple of 3 for SC|017557.1.1734!
In-frame STOP codon in SC|017557.1.1734 at position 27:30
Sequence length != multiple of 3 for SC|017557.1.1799!
In-frame STOP codon in SC|017557.1.1799 at position 0:3
Sequence length != multiple of 3 for SC|017557.1.1860!
In-frame STOP codon in SC|017557.1.1860 at position 27:30
Sequence length != multiple of 3 for SC|017557.1.2077!
In-frame STOP codon in SC|017557.1.2077 at position 42:45
Sequence length != multiple of 3 for SC|017557.1.2136!
Invalid codon G in SC|017557.1.2136
Sequence length != multiple of 3 for SC|017557.1.2150!
Invalid codon G in SC|017557.1.2150
Sequence length != multiple of 3 for SC|017557.1.2255!
Invalid codon AG in SC|017557.1.2255
Sequence length != multiple of 3 for SC|017557.1.2497!
Invalid codon AG in SC|017557.1.2497
Sequence length != multiple of 3 for SC|017557.1.2544!
Invalid codon G in SC|017557.1.2544
Sequence length != multiple of 3 for SC|017557.1.2729!
Invalid codon G in SC|017557.1.2729
Sequence length != multiple of 3 for SC|017557.1.2789!
In-frame STOP codon in SC|017557.1.2789 at position 57:60
Sequence length != multiple of 3 for SC|017557.1.2790!
In-frame STOP codon in SC|017557.1.2790 at position 21:24
Sequence length != multiple of 3 for SC|017557.1.2880!
Invalid codon C in SC|017557.1.2880
Sequence length != multiple of 3 for SC|017557.1.2931!
Invalid codon T in SC|017557.1.2931
Sequence length != multiple of 3 for SC|017557.1.2951.1!
Invalid codon CT in SC|017557.1.2951.1
Sequence length != multiple of 3 for SC|017555.1.148!
Invalid codon AG in SC|017555.1.148
Sequence length != multiple of 3 for SC|017555.1.169!
Invalid codon CA in SC|017555.1.169
Sequence length != multiple of 3 for SC|017555.1.243!
In-frame STOP codon in SC|017555.1.243 at position 108:111
Sequence length != multiple of 3 for SC|017555.1.326!
In-frame STOP codon in SC|017555.1.326 at position 0:3
Sequence length != multiple of 3 for SC|017555.1.370!
In-frame STOP codon in SC|017555.1.370 at position 18:21
Sequence length != multiple of 3 for SC|017555.1.505!
Invalid codon AA in SC|017555.1.505
Sequence length != multiple of 3 for SC|017555.1.605!
Invalid codon T in SC|017555.1.605
Sequence length != multiple of 3 for SC|017555.1.613!
Invalid codon CT in SC|017555.1.613
Sequence length != multiple of 3 for SC|017555.1.704!
In-frame STOP codon in SC|017555.1.704 at position 0:3
Sequence length != multiple of 3 for SC|017555.1.787!
Invalid codon CA in SC|017555.1.787
Sequence length != multiple of 3 for SC|017555.1.993!
Invalid codon G in SC|017555.1.993
Sequence length != multiple of 3 for SC|017555.1.996!
In-frame STOP codon in SC|017555.1.996 at position 6:9
Sequence length != multiple of 3 for SC|017555.1.1013!
Invalid codon AA in SC|017555.1.1013
Sequence length != multiple of 3 for SC|017555.1.1039!
Invalid codon AG in SC|017555.1.1039
Sequence length != multiple of 3 for SC|017555.1.1172!
In-frame STOP codon in SC|017555.1.1172 at position 156:159
Sequence length != multiple of 3 for SC|017555.1.1185!
In-frame STOP codon in SC|017555.1.1185 at position 0:3
Sequence length != multiple of 3 for SC|017555.1.1220!
Invalid codon G in SC|017555.1.1220
Sequence length != multiple of 3 for SC|017555.1.1296!
In-frame STOP codon in SC|017555.1.1296 at position 72:75
Sequence length != multiple of 3 for SC|017555.1.1525!
Invalid codon AG in SC|017555.1.1525
Sequence length != multiple of 3 for SC|017555.1.1577!
Invalid codon G in SC|017555.1.1577
Sequence length != multiple of 3 for SC|017555.1.1595!
Invalid codon C in SC|017555.1.1595
Sequence length != multiple of 3 for SC|017555.1.1606!
In-frame STOP codon in SC|017555.1.1606 at position 3:6
Sequence length != multiple of 3 for SC|017555.1.1609!
In-frame STOP codon in SC|017555.1.1609 at position 39:42
Sequence length != multiple of 3 for SC|017555.1.1610!
In-frame STOP codon in SC|017555.1.1610 at position 42:45
Sequence length != multiple of 3 for SC|017555.1.1875!
In-frame STOP codon in SC|017555.1.1875 at position 3:6
Sequence length != multiple of 3 for SC|017555.1.1897!
In-frame STOP codon in SC|017555.1.1897 at position 12:15
Sequence length != multiple of 3 for SC|017555.1.1918.7!
Invalid codon TA in SC|017555.1.1918.7
Sequence length != multiple of 3 for SC|017555.1.1990!
In-frame STOP codon in SC|017555.1.1990 at position 0:3
Sequence length != multiple of 3 for SC|017555.1.2024!
In-frame STOP codon in SC|017555.1.2024 at position 27:30
Sequence length != multiple of 3 for SC|017555.1.2025!
In-frame STOP codon in SC|017555.1.2025 at position 102:105
Sequence length != multiple of 3 for SC|017555.1.2117!
Invalid codon AG in SC|017555.1.2117
Sequence length != multiple of 3 for SC|017555.1.2151!
In-frame STOP codon in SC|017555.1.2151 at position 69:72
Sequence length != multiple of 3 for SC|017555.1.2187!
Invalid codon G in SC|017555.1.2187
Sequence length != multiple of 3 for SC|017555.1.2287!
Invalid codon G in SC|017555.1.2287
Sequence length != multiple of 3 for SC|017555.1.2488!
Invalid codon G in SC|017555.1.2488
Sequence length != multiple of 3 for SC|017555.1.2598!
Invalid codon G in SC|017555.1.2598
Sequence length != multiple of 3 for SC|017555.1.2663.1!
Invalid codon AC in SC|017555.1.2663.1
Sequence length != multiple of 3 for SC|017555.1.2706!
In-frame STOP codon in SC|017555.1.2706 at position 15:18
Sequence length != multiple of 3 for SC|017555.1.2728!
Invalid codon C in SC|017555.1.2728
Sequence length != multiple of 3 for SC|017555.1.2843!
Invalid codon GG in SC|017555.1.2843
Sequence length != multiple of 3 for SC|017555.1.2881!
In-frame STOP codon in SC|017555.1.2881 at position 162:165
Sequence length != multiple of 3 for SC|017555.1.2885!
Invalid codon T in SC|017555.1.2885
Sequence length != multiple of 3 for SC|017555.1.2908!
In-frame STOP codon in SC|017555.1.2908 at position 42:45
Sequence length != multiple of 3 for SC|017555.1.2949!
Invalid codon GA in SC|017555.1.2949
Sequence length != multiple of 3 for SC|017555.1.3047!
Invalid codon AG in SC|017555.1.3047
Sequence length != multiple of 3 for SC|017555.1.3048!
Invalid codon C in SC|017555.1.3048
Sequence length != multiple of 3 for SC|017555.1.3050!
Invalid codon T in SC|017555.1.3050
Sequence length != multiple of 3 for SC|017555.1.3141!
Invalid codon GT in SC|017555.1.3141
Sequence length != multiple of 3 for SC|017555.1.3222!
In-frame STOP codon in SC|017555.1.3222 at position 9:12
Sequence length != multiple of 3 for SC|017555.1.3251!
Invalid codon AA in SC|017555.1.3251
Sequence length != multiple of 3 for SC|017555.1.3302!
In-frame STOP codon in SC|017555.1.3302 at position 0:3
Sequence length != multiple of 3 for SC|017570.1.12!
Invalid codon A in SC|017570.1.12
Sequence length != multiple of 3 for SC|017570.1.154!
In-frame STOP codon in SC|017570.1.154 at position 6:9
Sequence length != multiple of 3 for SC|017570.1.280!
Invalid codon G in SC|017570.1.280
Sequence length != multiple of 3 for SC|017570.1.352!
Invalid codon TG in SC|017570.1.352
Sequence length != multiple of 3 for SC|017570.1.400!
Invalid codon AG in SC|017570.1.400
Sequence length != multiple of 3 for SC|017570.1.480!
Invalid codon G in SC|017570.1.480
Sequence length != multiple of 3 for SC|017570.1.483!
Invalid codon T in SC|017570.1.483
Sequence length != multiple of 3 for SC|017570.1.491!
Invalid codon AT in SC|017570.1.491
Sequence length != multiple of 3 for SC|017570.1.512!
Invalid codon AT in SC|017570.1.512
Sequence length != multiple of 3 for SC|017570.1.523!
In-frame STOP codon in SC|017570.1.523 at position 9:12
Sequence length != multiple of 3 for SC|017570.1.536!
Invalid codon A in SC|017570.1.536
Sequence length != multiple of 3 for SC|017570.1.580!
Invalid codon CA in SC|017570.1.580
Sequence length != multiple of 3 for SC|017570.1.586!
Invalid codon C in SC|017570.1.586
Sequence length != multiple of 3 for SC|017570.1.587!
In-frame STOP codon in SC|017570.1.587 at position 24:27
Sequence length != multiple of 3 for SC|017570.1.591!
In-frame STOP codon in SC|017570.1.591 at position 51:54
Sequence length != multiple of 3 for SC|017570.1.596!
Invalid codon A in SC|017570.1.596
Sequence length != multiple of 3 for SC|017570.1.707!
In-frame STOP codon in SC|017570.1.707 at position 42:45
Sequence length != multiple of 3 for SC|017570.1.715!
Invalid codon G in SC|017570.1.715
Sequence length != multiple of 3 for SC|017570.1.720!
In-frame STOP codon in SC|017570.1.720 at position 6:9
Sequence length != multiple of 3 for SC|017570.1.725!
In-frame STOP codon in SC|017570.1.725 at position 33:36
Sequence length != multiple of 3 for SC|017570.1.733!
Invalid codon AG in SC|017570.1.733
Sequence length != multiple of 3 for SC|017570.1.826!
Invalid codon G in SC|017570.1.826
Sequence length != multiple of 3 for SC|017570.1.879!
In-frame STOP codon in SC|017570.1.879 at position 21:24
Sequence length != multiple of 3 for SC|017570.1.909!
In-frame STOP codon in SC|017570.1.909 at position 24:27
Sequence length != multiple of 3 for SC|017570.1.1039!
Invalid codon AG in SC|017570.1.1039
Sequence length != multiple of 3 for SC|017570.1.1051!
Invalid codon A in SC|017570.1.1051
Sequence length != multiple of 3 for SC|017570.1.1081!
Invalid codon G in SC|017570.1.1081
Sequence length != multiple of 3 for SC|017570.1.1166!
Invalid codon GT in SC|017570.1.1166
Sequence length != multiple of 3 for SC|017570.1.1169!
In-frame STOP codon in SC|017570.1.1169 at position 21:24
Sequence length != multiple of 3 for SC|017570.1.1186!
Invalid codon T in SC|017570.1.1186
Sequence length != multiple of 3 for SC|017570.1.1265!
Invalid codon T in SC|017570.1.1265
Sequence length != multiple of 3 for SC|017570.1.1341!
Invalid codon A in SC|017570.1.1341
Sequence length != multiple of 3 for SC|017570.1.1441!
In-frame STOP codon in SC|017570.1.1441 at position 69:72
Sequence length != multiple of 3 for SC|017570.1.1539!
Invalid codon GG in SC|017570.1.1539
Sequence length != multiple of 3 for SC|017570.1.1634!
Invalid codon G in SC|017570.1.1634
Sequence length != multiple of 3 for SC|017570.1.1682!
In-frame STOP codon in SC|017570.1.1682 at position 66:69
Sequence length != multiple of 3 for SC|017570.1.1692!
In-frame STOP codon in SC|017570.1.1692 at position 3:6
Sequence length != multiple of 3 for SC|017570.1.1752!
Invalid codon GG in SC|017570.1.1752
Sequence length != multiple of 3 for SC|017570.1.1753!
Invalid codon G in SC|017570.1.1753
Sequence length != multiple of 3 for SC|017570.1.1754!
In-frame STOP codon in SC|017570.1.1754 at position 150:153
Sequence length != multiple of 3 for SC|017570.1.1810!
Invalid codon GA in SC|017570.1.1810

It seems that the error is about the length of the sequence not being a multiple of three. I still get the output file “Sc.ksd”, but the svg result is still blank. And my code, input and output file are following.

singularity exec wgd.sif wgd dmd -e 1e-10 --nostrictcds -o Sc.dmd input.cds.fasta

singularity exec wgd.sif wgd ksd --n_threads 10 -mp 1000 -o Sc.ksd Sc.dmd/Sc.relab.cds.fasta.mcl
test.zip

Unable to install wgd

hi

I have tried installing wgd from source.

But failed due to numpy error. IM unable to install older verwsion of numpy==1.19.0.

Is there any other way to install with latest version of numpy?

inference of Ks value from paranome

Hi
I have run the whole paranome ks distribution and i received the plot through "viz" parameter with elmm mixture model. But i am confused with different values associated with different lognormal optimization, as what would be the final Ks value from the plot. Kindly help.

PFA
Ks_inference.pdf

wgd dmd

Hi,
I trying to analysis a single species using WGD V2. In the species delineation step, I am sure how to generate the families file that I need in the Ks step. I tried all relevant wgd dmd options, but none of them writes a "families" single file. I am sure I am missing something here. I'd really appreciate your answers.

best,

--globalmrbh

Hi, I tried for my 5 individual species for the wgd dmd --globalmrbh. I got the error like this:

Screenshot 2024-06-05 at 14 58 09

My commands are like this:

wgd dmd --globalmrbh 20628_8.cds.fasta 20628_9.cds.fasta 20628_1.cds.fasta 20896_1.cds.fasta 20896_2.cds.fasta -o global

Any suggestion for this error? Thanks for your time.

wgd sync

Hi,
This question was solved.
Thanks

erro for ksd

When I use the steps of ksd, the following error will appear, and all codeml files are empty files, how to solve it?

WARNING No codeml result for GF00003411 due to no resolved nucleotides codeml.py:234
INFO Analysing family GF00003415 core.py:2873
WARNING No codeml result for GF00003412 due to no resolved nucleotides codeml.py:234
INFO Analysing family GF00003416 core.py:2873
WARNING No codeml result for GF00003413 due to no resolved nucleotides codeml.py:234
INFO Analysing family GF00003417 core.py:2873
WARNING No codeml result for GF00003414 due to no resolved nucleotides codeml.py:234
WARNING No codeml result for GF00003415 due to no resolved nucleotides codeml.py:234

Error with numpy==1.24.4

Hi Hengchi and thank you for improving the tool!

I installed wgd v2.0.22 into a brand new venv and got an error upon running wgd viz:


Traceback (most recent call last):
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/bin/wgd", line 11, in 
    load_entry_point('wgd', 'console_scripts', 'wgd')()
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/cli.py", line 557, in viz
    _viz(**kwargs)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/cli.py", line 560, in _viz
    from wgd.viz import elmm_plot, apply_filters, multi_sp_plot, default_plot,all_dotplots,filter_by_minlength,dotplotunitgene,dotplotingene,filter_mingenumber,dotplotingeneoverall
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd/viz.py", line 20, in 
    from sklearn import mixture
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/sklearn/mixture/__init__.py", line 5, in 
    from ._gaussian_mixture import GaussianMixture
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/sklearn/mixture/_gaussian_mixture.py", line 11, in 
    from ._base import BaseMixture, _check_shape
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/sklearn/mixture/_base.py", line 14, in 
    from .. import cluster
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/sklearn/cluster/__init__.py", line 6, in 
    from ._spectral import spectral_clustering, SpectralClustering
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/sklearn/cluster/_spectral.py", line 16, in 
    from ..neighbors import kneighbors_graph, NearestNeighbors
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/sklearn/neighbors/__init__.py", line 17, in 
    from ._nca import NeighborhoodComponentsAnalysis
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/sklearn/neighbors/_nca.py", line 22, in 
    from ..decomposition import PCA
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/sklearn/decomposition/__init__.py", line 17, in 
    from .dict_learning import dict_learning
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/sklearn/decomposition/dict_learning.py", line 5, in 
    from . import _dict_learning  # type: ignore
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/sklearn/decomposition/_dict_learning.py", line 21, in 
    from ..linear_model import Lasso, orthogonal_mp_gram, LassoLars, Lars
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/sklearn/linear_model/__init__.py", line 11, in 
    from ._least_angle import (Lars, LassoLars, lars_path, lars_path_gram, LarsCV,
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/sklearn/linear_model/_least_angle.py", line 34, in 
    method='lar', copy_X=True, eps=np.finfo(np.float).eps,
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

So it seems that the installation instructions include the newest version of numpy:

$ pip freeze | grep numpy
numpy==1.24.4

Installing numpy==1.22 into the venv fixed the issue for me. Maybe it's worth adding numpy version limitation into the installation files?

Best, Nikita

`wgd viz` not running with `--plotsyn`

Running the following in wgd 2.0.22:

wgd viz -d wgd_ksd/global_MRBH.tsv.ks.tsv \
    -epk species1_ksd/species1.fa.tsv.ks.tsv -sp speciestree.txt -rw -ap species1_syn/iadhore-out/anchorpoints.txt  \
    -sr "species1.fa;species2.fa" -sr "species1.fa;species3.fa" -sr "species1.fa;species4.fa" -sr "species1.fa;species1.fa" \
    -gs wgd_ksd/gene_species.map --plotkde --plotelmm --plotsyn

I get the following error as if something was wrong with the command:


14:11:37 INFO     This is wgd v2.0.22                                                                                                               cli.py:32
Traceback (most recent call last):
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/bin/wgd", line 11, in 
    load_entry_point('wgd', 'console_scripts', 'wgd')()
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/cli.py", line 557, in viz
    _viz(**kwargs)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/cli.py", line 572, in _viz
    table = pd.read_csv(genetable,header=0,index_col=0,sep=',')
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 678, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 575, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 932, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1216, in _make_engine
    self.handles = get_handle(  # type: ignore[call-overload]
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/pandas/io/common.py", line 667, in get_handle
    ioargs = _get_filepath_or_buffer(
  File "/netscratch/dep_mercier/grp_novikova/software/wgd/wgd_2.0.22/lib/python3.8/site-packages/pandas/io/common.py", line 424, in _get_filepath_or_buffer
    raise ValueError(msg)
ValueError: Invalid file path or buffer object type: 

Removing the --plotsyn portion of the command results in no error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.