Giter Site home page Giter Site logo

supermatrix's Introduction

supermatrix

scripts to add new proteins to an existing alignment using hmms, and simple diagnostics/visualizations

The software here is intended to extend existing supermatrix alignments with new taxa, although a number of general purpose programs are also given here. If something does not work, please email me.

Jump to:

Please also note that some similar diagnostic/manipulation tools exist in other packages by various people, including AMAS, SCaFoS and BuddySuite

Note that most Python scripts below require BioPython.

add_taxa_to_align

Script to add new taxa to an existing protein supermatrix alignment. Proteins from new taxa are untrimmed, though a trimming step may be implemented. Input alignments could be in a number of formats, as a supermatrix with a separate partition file, or individual alignments as separate files.

  • Multiple new taxa can be added one of two ways, with -t, as space-separate list of protein files (could be gene models or translated from transcriptomes) with -T as species names. Fasta files from -t and species names from -T must be in the same order. Alternatively, a tabular input file may be used (with -X) to specify both fasta file and species name for each new species.
  • By default, only the single best hit is taken (changed with -m), and is renamed to the corresponding species given in -T.
  • Several folders with many intermediate files are generated (-d, -E, -I, and -S), in case it is needed to re-examine later. These are automatically named with each run (so they cannot overwrite each other), and probably do not need to be changed.
  • For alignment format (-f), most cases phylip format is actually phylip-relaxed.
  • By default, the e-value cutoff is determined uniquely for each gene based on the lower limit of that hmm against the original gene set. This is to reduce the chance of finding out-paralogs. However, a static e-value cutoff for hmmsearch can be given using --ev-threshold, though this is not advised. See below for an explanation.
  • To generate the new supermatrix with the added taxa, specify the name of the new file with -U. A new partition file will be automatically generated if -U is specified.

For example, to specify new species with -t and -T:

add_taxa_to_align.py -a philippe2009_FullAlignment.phy -i philippe2009_partitions.txt -t ~/genomes/apis_mellifera/amel_OGSv3.2_pep.fa -T Apis_mellifera -f phylip-relaxed -U philippe2009_w_amel.aln

To instead use a tabular input to specify multiple input files and species, specify the text file with -X and do not use options -t and -T.

add_taxa_to_align.py -a philippe2009_FullAlignment.phy -i philippe2009_partitions.txt -X new_taxa.txt -f phylip-relaxed -U philippe2009_w_amel.aln

The tabular input file is a text file where each line has one fasta file and then the corresponding taxon name, separated by a tab.

~/genomes/apis_mellifera/amel_OGSv3.2_pep.fa	Apis_mellifera
~/genomes/trichinella_spiralis/t_spiralis.WS248.protein.fa	Trichinella_spiralis
...

Requires BioPython, hmmsearch and hmmbuild, and mafft v7.3.10, though could be modified to use any aligner. Older versions fo mafft are compatible if using the -r option (current script requires an option in v7.3), which skips the mafft alignment-trimming step.

Binaries are assumed to be in the user's PATH. This can be changed with the options --mafft, --hmmbin, and --fasttree. Both --mafft and --fasttree should point to binaries, but --hmmbin points to a folder containing both hmmsearch and hmmbuild.

add_taxa_to_align.py -a philippe2009_FullAlignment.phy -i philippe2009_partitions.txt -t ~/genomes/apis_mellifera/amel_OGSv3.2_pep.fa -T Apis_mellifera -f phylip-relaxed -U philippe2009_w_amel.aln --mafft ~/programs/mafft --hmmbin ~/programs/

All messages and reports can be captured as standard error (using 2>), such as 2> philippe2009_w_new_taxa.log. It is recommended to do this (possibly in verbose mode -v) as the log contains information as to why every hmmsearch hit was selected or rejected, and it may be necessary to manually correct some sequences.

join_alignments

Join multiple individual alignment files into a supermatrix, allowing for only one occurence of any taxa in each alignment. Names must be the same, though can have unique identifiers (like gene names or numbers) as long as they can be systematically split from the taxon names (using -d).

join_alignments.py -a hehenberger2017_alignments/* -d "@" -u hehenberger2017_supermatrix.fasta

This can also be used to rejoin alignments produced by add_taxa_to_align.py that are manually edited. Use the -A option to detect the order from the default output naming scheme of the alignments.

This will automatically generate a partition file for the supermatrix, adding .partition.txt to the name from -u.

split_supermatrix_to_genes

Split a supermatrix back into individual alignment files in fasta format, one for gene defined by the partition file. An optional output directory can be given with -d, otherwise files are placed in the present working directory. This is the reverse operation of join_alignments.py.

./split_supermatrix_to_genes.py -a simion2017_97sp_401632pos_1719genes.fasta.gz -d simion_genes -p simion2017_partitions.txt

split_supermatrix_to_taxa

Split a supermatrix into fasta files, one for each taxa where individual proteins are defined by the partition file. Empty proteins are ignored, but gaps are retained, meaning gaps may need to be removed later depending on the next step. This is NOT the reverse operation of join_alignments.py, which joins multiple alignment files into a supermatrix.

./split_supermatrix_to_taxa.py -a simion2017_97sp_401632pos_1719genes.fasta.gz -d simion_taxa -p simion2017_partitions.txt

check_supermatrix_alignments

Quick diagnostic script to check matrix occupancy. Adjust format accordingly based on the alignment using the -f option. As above, in most cases phylip format is probably phylip-relaxed.

check_supermatrix_alignments.py -a philippe2009_FullAlignment.phy -p philippe2009_partitions.txt -f phylip-relaxed

To generate a chart of matrix occupancy, add the -m option with the name of the new output file. This matrix can be visualized using the R script draw_matrix_occupancy.R.

check_supermatrix_alignments.py -a philippe2009_FullAlignment.phy -p philippe2009_partitions.txt -f phylip-relaxed -m philippe2009_occupancy_matrix.tab

To reorder the matrix based on taxa in a rooted tree, use the -T option with a nexus-format tree. For instance, open any tree in figtree, rotate branches, etc, then copy and paste the tree to a text file, and this is in nexus format. Then run draw_matrix_occupancy.R below.

check_supermatrix_alignments.py -p Metazoa_full_Models_short.txt -a Metazoa_full-fix.phy -f phylip-relaxed -m Metazoa_full_matrix.tab -T Metazoa_full.nex

Because Present is coded as 50-100% of the full length, this may hide taxa that have mostly partial sequences. To instead output the occupancy matrix as the percentage of the full length gene, use the --percent option. The R script draw_matrix_occupancy.R can be used on this matrix as well.

check_supermatrix_alignments.py -p Metazoa_full_Models_short.txt -a Metazoa_full-fix.phy -f phylip-relaxed --percent -m Metazoa_full_percent_matrix.tab -T Metazoa_full.nex

compare_supermatrix_alignments

Directly compare two output supermatrices, say from two different runs of add_taxa_to_align.py using slightly different parameters. This will show genes that missing in one or the other, or are different between the two, perhaps due to finding incorrect genes or different splice variants.

Note that partitions must be the same, meaning only vary by presence or absence.

filter_supermatrix

Filter supermatrices based on coverage for each gene (not removing individual sites). Minimum coverage is given by -c for values between 0 and 1. A new partition file is automatically generated based on the output name -o.

filter_supermatrix.py -a simion2017_97sp_401632pos_1719genes.fasta -c 0.75 -p simion2017_partitions.txt -o simion2017_97sp_75cov.fasta

draw_matrix_occupancy

A graph of matrix occupancy can be generated with the accompanied R script. Genes that are present are colored blue, partial genes are red, and absent genes are white, for each taxa. Taxa with 100% occupancy are colored green, while those with under 50% are colored purple. By default, the taxon order in the graph is the same order as given in the supermatrix alignment, which is then preserved in the matrix occupancy chart above. Taxa can be reordered arbitrarily in either file.

The script can be run in the terminal as:

Rscript draw_matrix_occupancy.R philippe2009_occupancy_matrix.tab

philippe2009_occupancy_matrix.png

coverage_by_site

Simple diagnostic for checking coverage by site, providing a histogram of frequency of number of gaps. Change format using -f. The script can accept gzipped files.

coverage_by_site.py -a alignments/philippe2009_FullAlignment.phy -f phylip-relaxed

Some results are summarized below. With the exception of the Borowiec 2015 study, which made use of entirely genomic data, none of them have a single site that is covered by all taxa. Even for the Borowiec study, this is only 180 sites out of 384k, so not even close to 1%.

Dataset Taxa Genes Occupancy % Total sites Max cov (gaps) Sites w/ max Avg gaps %
Dunn 2008 77 65 48.07 (2.6) 9918 70 (7) 21 42.60
Hejnol 2009 94 1486 17.61 (1.2) 270392 77 (17) 53 79.35
Philippe 2009 55 128 79.32 (2.6) 30257 54 (1) 91 14.79
Ryan 2013 EST 70 406 46.18 (4.1) 88384 58 (12) 52 41.52
Nosenko 2013 R 71 87 76.15 (2.2) 14612 63 (8) 105 19.24
Nosenko 2013 nR 71 35 70.14 (4.8) 9187 67 (4) 3 24.98
Whelan 2015 D10 70 210 68.10 (7.8) 59733 68 (2) 99 26.65
Whelan 2015 D16 70 87 71.48 (10.6) 23680 68 (2) 70 24.51
Borowiec 2015 36 1080 79.84 (6.7) 384981 36 (0) 180 8.69
Cannon 2016 78 212 80.25 (0.2) 44896 75 (3) 59 24.25
Simion 2017 97 1719 65.88 (8.2) 401632 95 (2) 28 38.07

test alignments and occupancy matrices

Some alignments can be found in the alignments folder, and PDFs of occupancy matrices can be found in the matrix folder. Datasets used are:

determination of evalues for each partition

For programs like BLAST or HMMSEARCH, the bitscore, and ultimately the E-value, is dependent on the length of the matched portion. Thus, a good match of a short protein will never have a bitscore as high as a good match for a long protein. For this reason, a static E-value cutoff is not suitable for identifying orthologs in new species.

Considering the chart below, each point represents a self-hit from the HMM profile against the original dataset that was used to make the profile. The longest proteins are dark blue (~700AAs) and the shortest ones are red (~100AAs), with a gradient in between. Proteins that are 80% of the length of the alignment are indicated by the dark circles.

philippe2009_w_coral_selfhits.png

It is immediately evident that the highest E-values belong to the longest proteins. Thus, it is clear that a single value cannot be used as a filter for all partitions in a supermatrix. However, even for a long protein, it is clear that many proteins in the original set have E-values substantially lower than the max. These are partial proteins that are kept in the matrix. Thus, the threshold for each partition must be determined primarily by the long proteins, not the lowest value.

A single threshold by E-value based on only long sequences would work, but this would never allow partial matches, as the target proteins would probably have to be the same length as most of the complete sequences. For this, an additional heuristic is used, based on the bitscore-per-length (BpL). This measurement can effectively sort out out-paralogs, but can also help to identify partial sequences. This is necessary because a closely related protein may be full length, and ultimately get a higher bitscore, than a real protein that is only partially complete (say in a transcriptome). Thus, any hits that have a higher BpL than the mean for that partition are kept anyway, even if they are too short, and this takes precedent over a longer hit with a much lower BpL.

supermatrix's People

Contributors

wrf avatar

Watchers

Tonny Kinene avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.