simonhmartin / twisst Goto Github PK

Topology weighting by iterative sampling of sub-trees

License: GNU General Public License v3.0

Python 70.23% R 29.77%

twisst's Introduction

Twisst: Topology weightying by iterative sampling of sub-trees

Topology weighting is a means to quantify relationships between taxa that are not necessarily monophyletic. It's a simple, descriptive method, designed for exploring how relationship vary across the genome using population genomic data.

The relationship among a given set of taxa can be defined by a number of possible topologies. For example, for four taxa labelled A, B, C and D, there are three possible (unrooted) bifurcating topologies:

  A-\    /-C         A-\    /-B         A-\    /-B
     |--|               |--|               |--|
  B-/    \-D         C-/    \-D         D-/    \-C

Given a tree with any number of tips (or leaves), each belonging to a particular taxon, the weighting of each taxon topology is defined as the fraction of all unique sub-trees, in which each taxon is represented by a single tip, that match that topology. Topology weighting therefore reduces the complexity of the full tree to a number of values, each giving the proportionate contribution of a particular taxon tree to the full tree.

This code implements the method Twisst (topology weighting by iterative sampling of sub-trees), which does what it says: it computes the weightings by iteratively sampling sub-trees from the full tree and checking their topology. This can be slow if there are many tips (e.g. 4 taxa with ten tips each gives 10 000 unique subtrees to consider. But there are some shortcuts to speed things up - see Weighting Method below.

Papers

Martin and Van Belleghem 2017 is where we present the Twisst method in full and test it on simulated and real data. Please cite this when using the software.

Van Belleghem et al. 2017 is where we first applied Twisst to identify narrow shared blocks in regulatory reagions.

Code

The main script, twisst.py implements the topology weighting.

It requires ete3 and numpy (tested on version 1.8).

A typical command looks like this:

python twisst.py -t input.trees.gz -w output.weights.csv.gz -g A 1,2,3,4,5 -g B 6,7,8,9,10 -g C 11,12,13,14,15 -g D 16,17,18,19,20

You can get a full list ot command options with python twisst.py -h.

NOTE Due to ongoing improvements, the script run_twisst_parallel.py is currently not working. However, the speed improvements to the core script make it usable even for quite large data sets.

Input

The main input is a tree file containing one or more trees in newick format. Most variants of newick are accepted - see the ETE documentation for details.

Multiple trees should be listed on separate lines.

All trees must contain all specified individual names as tip labels (see below.)

Output

There are two outputs:

The topologies file is specified with the -o flag. This file contains the possible taxon topologies in newick format.
The weights file is specified with the -w flag. This is a comma separated file, with one column for each topology, giving its weighting. The weightings are given in absolute counts (rather than proportions). This allows for estimation of confidence intervals downstream, if desired.

Specifying taxa and individuals

Taxa (groups) must be specified in the command line, using the -g flag. This flag must be present at least four times (with three groups there is only one possible unrooted topology).

The name of the group must be given after the flag, (optionally) followed by the names of the individuals (tip labels) that it contains, separated by commas (e.g: -g A 1,2,3,4,5).

Alternatively (or additionally), a tab-delimited file of tip labels and their corresponding groups can be provided with the --groupsFile flag. This should have tip labels in the first column and group names in the second column. The group names given in the groups file must match those in the command line.

For example, the example command above could alternatively be specified as

python twisst.py -t input.trees.gz -w output.weights.csv.gz -o topologies.trees -g A g B -g C -g D --method complete --groupsFile groups.tsv

Where groups.tsv is a text file containing the following:

1	A
2	A
3	A
4	A
5	A
6	B
7	B
8	B
9	B
10	B
11	C
12	C
13	C
14	C
15	C
16	D
17	D
18	D
19	D
20	D

Weighting Method

There are three options for the weighting method, specified with the --method flag.

The default method is complete. This will calculate the exact weightings by considering all subtrees. It performs a smart shortcut by collapsing monophyletic nodes, and can be quite fast if the taxa are well resolved and/or the tree is small. However, if the taxa are unresolved and the tree is large, this method may be too slow to be useful. If there are too many combinations to test (default > 1,000,000), the script will abort and revert to the fixed method described below. You can control this behaviour using the --abortCutoff option.

--method fixed will cause the script to estimates the weightings by randomly sampling a fixed number of subtrees. Sampling is done with replacement, so that the errors will fit a simple binomial distribution. The default number of iterations is 10,000, which gives very narrow error margins, but this can be modified using the --iterations flag.

The original version of twisst had an option to keep sampling until some level of confidence was reached. However, with speed improvements, sampling of thousands of combinations is achievable and the error margins become tiny and essentially irellevant. There is a supplementary figure in Martin and Van Belleghem 2017 demonstrating this.

Pipeline to generate the input trees file

There are various options for producing trees for windows across the genome. If you have whole genome sequence data, it is recommended to infer trees for narrow genomic intervals. 50 SNPs proved a useful window size in various simulations. If the window is too large, you may be averaging over regions of distinct ancestry, which can eliminate subtle quantitative variation in taxon relationships. However, if the interval is too small, you may have insufficient signal to infer a good tree.

My approach is to subset the alignment into windows and infer trees for each separately, using either maximum likelihgood or neighbour joining methods. PhyML can do both. RAxML has an option to do the sliding windows automatically. I don't recommend Saguaro as it tends to be biased towards the most abundant topologies.

My pipeline from BAM to trees

Starting with bam files, I genotype with GATK, using the HaplotypeCaller and GenotypeGVCFs tools. Here are example commands:

#HaplotypeCaller
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -nct 16 -R reference.fa -I input.bam -o output.g.vcf --emitRefConfidence GVCF --output_mode EMIT_ALL_CONFIDENT_SITES
#GenotypeGVCFs
java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -nt 16 -R reference.fa -V output.g.vcf --includeNonVariantSites -o output.vcf

I filter vcfs using Bcftools. This allows you to remove indels and invariant sites. I also like to convert uncertain genotypes to missing (./.) before phasing. Here is an example command that will convert all genotypes with < 5x depth of coverage and GQ < 30 to missing.

bcftools filter -e 'FORMAT/DP < 5 | FORMAT/GQ < 30' --set-GTs . input.vcf.gz -O u | bcftools view -U -i 'TYPE=="snp" & MAC >= 2' -O z > output.vcf.gz

For diploids, you can infer phase using Beagle 4 or SHAPEIT. Here is an example Beagle command:

java -Xmx12g -jar beagle.jar gt=input.vcf.gz out=output.vcf.gz impute=true nthreads=20 window=10000 overlap=1000 gprobs=false

My script to generate the trees takes a simple genotype format as input, which gives the scaffold, position and genotype for each sample. My code to generate this file from a vcf is in my repo genomics_general. Here is an example command:

python parseVCF.py -i input.vcf.gz --skipIndels --minQual 30 --gtf flag=DP min=5 | gzip > output.geno.gz

To get neighbour joining trees for snp windows, I have a script that runs Phyml for windows, using parallelisation, and outputs a single trees file. You can get the script from my genomics_general repo. Here is an example command:

python phyml_sliding_windows.py -T 10 -g input.phased.geno.gz --prefix output.phyml_bionj.w50 -w 50 --windType sites --model GTR --optimise n

NOTE: if you use phased diploid genotypes in phyml_sliding_windows.py, the output trees will include two tips for each sample, with suffixes "_A" and "_B". You will need to ensure that the groupd defined for Twisst match these names.

twisst's People

Contributors

Stargazers

Watchers

Forkers

transgirlcodes tuanaduong stevenvb12 biozhangzhou akp-ut javibio-git kaiserinlissi hj1994412 guzhongru hildeschneemann zyworship tholtzem giyany douglee wheatwill yuzhenpeng ningshuang-yao yancheer

twisst's Issues

Any statistical methods to infer "significant" signals of introgression via local regions ?

Would anyone be able to give insight if there is a statistical method for Twisst results to infer "significant" signals of introgression via local regions of the genome where the alternative topology is greater than the true topology?

thank you

.....Warning - failed to read tree.

Will “.....Warning - failed to read tree.” influence my analysis?

Plotting by chromosome

Hi! TWISST is great and I haven't had any problems so far, but I was wondering if you had a suggestion on how to plot the TWISST result by chromosome? Mostly how to plot it so that each chromosome would be on the same scale? If you have topology weight across the chromosomes seems like the plotting results are shown for all regions?

diploid species tree IDs

hi， I uesd unphased VCF file of a dipoild soybean to build tree file by 'phyml_sliding_windows.py'. but the result has a '_A' and '_B' suffix . is that correct ? if I want a result without the '_A' and '_B' suffix ,What should I do?

problem in getting trees by running raxml_sliding_windows.py

Hi Simon
I want to run TWISST for my dataset. I followed your tutorial to get the trees as input for TWISST.

I have a diploid genotype and my data is not suitable for phasing, then I used your script to produce the .geno file. And then, I used raxml_sliding_windows.py to get the window trees. It ran without any error and also gave me the .data.tsv file but the .trees.gz file was included only NA. I also ran this script for the example file (msms_4of10_l50k_r500_sweep.seqgen.SNP.geno.gz), but the same happens.

I used python2 and this command:

python raxml_sliding_windows.py -g msms_4of10_l50k_r500_sweep.seqgen.SNP.geno.gz --prefix out.50 --windType sites -w 50 --model HKY85

Do you have any idea why it happened?

I also have another question: Is it possible to use the ML trees I got by using sequences (including both variant and invariant sites) in IQTREE2 as input for TWISST?

With the best and thanks
Niloo

Different number of tips per windows

Hi @simonhmartin,

I just want to report here a small issue I have been running into with your very well thought program twisst.

By running your python code phyml_sliding_windows.py, I noticed that the program wasn’t reported the same amount of tip for each tree/window. It seams that 100% identical sequence were discarded by phyml.

I will suggest to use the phyml option --leave_duplicates in the phyml command line to avoid to get this problem. I don’t think this will affect the efficiency of the program

Best and thanks for sharing those nice and useful programs/tutorials
Josquin

Newick trees without predefined windows

My question is regarding the method that is used to construct the trees. For demography analysis, I use Relate https://myersgroup.github.io/relate/. Relate estimates genealogies based on the recombination rate, so obviously, each tree covers a different length on the genome (resulting in a very different number of SNPs per tree).
Is Twisst still a good method to use for counts and should some additional "weights" be put on top of the proportion of the topologies to account for the length of the region?

In your Rscript for plotting you provide a file window_data_file which could be produced for the Relate output (everything except the likelihood column could be provided). Could this potentially solve the issue of the different lengths?

Dealing with missing data

Hi Simon, thanks for the great code, your sliding window scripts and Twisst scripts have been invaluable to many of my analyses! I have a question though. No matter what I set my window size or my minimum number of SNPs to, there are always some trees that fail and produce NAs. I experimented with just using sed to remove those NAs in order to get Astral to reconcile my species tree (it doesn't handle NAs) and I have tried this same technique for your Twisst analyses. But I suspect that my problem is two fold. In some cases my SNP density is low and I have fewer than 50 SNPs in a window, even if I make the window very large. In these scenarios merely removing NAs seems to do the trick (though I'm not sure it's the most appropriate workaround). But the second problem that sometimes happens is that there are trees with missing taxa. This is a problem with a less obvious solution. My question to you is, if Twisst reports that a tree failed, because of one of the abovementioned reasons, will the analysis still produce a reliable output? If not, do you know a way of filtering data so that only data for which all individuals are represented is retained? VCFtools, GATK, and BCFtools filters all seem to filter on more of a per site basis than a per individual basis and it makes it tricky to work with missing data in downstream analyses. Lastly, if I run Twisst and it produces a weights file that has weights but has 'nan' for trees that failed, would it be appropriate for me to just remove the 'nan's from the file and then analyse it in R?

Issue plotting TWISST results

Hello,
I am currently trying to plot results from multiple TWISST runs (1 per chromosome), but I run into the following error:

> list.w <- list.files(pattern="_w") #vector of the weight files
> list.T <- list.files(pattern="_T") #vector of the topology files
> list.tsv <- list.files(pattern="data.tsv") #vector of the window files
> twisst.all <- import.twisst(weights_files = list.w, window_data_files = list.tsv, topos_file=list.T)
[1] "Reading weights and window data"
[1] "Number of regions: 31"
[1] "Computing summaries"
[1] "Cleaning data"
[1] "Getting topologies"
Error in file(file, "r") : invalid 'description' argument
In addition: Warning messages:
1: In is.na(apply(l$weights[[i]], 1, sum)) == F & l$window_data[[i]]$end -  :
  longer object length is not a multiple of shorter object length
2: In is.na(apply(l$weights[[i]], 1, sum)) == F & l$window_data[[i]]$end -  :
  longer object length is not a multiple of shorter object length
3: In is.na(apply(l$weights[[i]], 1, sum)) == F & l$window_data[[i]]$end -  :
  longer object length is not a multiple of shorter object length
4: In is.na(apply(l$weights[[i]], 1, sum)) == F & l$window_data[[i]]$end -  :
  longer object length is not a multiple of shorter object length
5: In is.na(apply(l$weights[[i]], 1, sum)) == F & l$window_data[[i]]$end -  :
  longer object length is not a multiple of shorter object length
6: In is.na(apply(l$weights[[i]], 1, sum)) == F & l$window_data[[i]]$end -  :
  longer object length is not a multiple of shorter object length
7: In is.na(apply(l$weights[[i]], 1, sum)) == F & l$window_data[[i]]$end -  :
  longer object length is not a multiple of shorter object length
8: In is.na(apply(l$weights[[i]], 1, sum)) == F & l$window_data[[i]]$end -  :
  longer object length is not a multiple of shorter object length
9: In if (file == "") file <- stdin() else { :
Error in file(file, "r") : invalid 'description' argument"

If I try to run the same thing without window files, I get a different error:

> twisst.all <- import.twisst(weights_files = list.w, topos_file=list.T)
[1] "Reading weights"
Error in names(l$window_data) <- names(l$weights_raw) <- paste0("region",  : 
  'names' attribute [31] must be the same length as the vector [1] "

I thought this error could be related to missing data (windows where phylogenetic inference failed) in some chromosomes, but trying to run these commands just on chromosomes without missing data gave the same results.
Also, it should be noted that I am able to plot the chromosomes individually without specifying the window file (chromosomes with missing data) or with specifying window files (chromosomes without missing data).

Is there any way to overcome this issue and plot all the chromosomes at once, preferably with the window files specified?

Thanks a lot in advance,
Loïs Rancilhac

error twisst: can not run twisst.py

Hi Simon,
I was following your instruction in general genomics pakage and running sliding wimdow tree inference with phyml. I manage to generate the trees.gz file as an mput for twisst.py. However when running twisst.py using the following command:
python twisst.py -t all_rename_filter2_geno0_phased_renammed_chr_phyml_ML.w50.trees.gz -w output_w50.weights.csv.gz -g ang -g Bta -g bux -g der -g eur -g imb -g ory -g scr -g spe -g str -g syl --method complete --groupsFile groups.tsv;
I got the following message:
Traceback (most recent call last):
File "twisst.py", line 576, in
topoDict = makeTopoDict(taxonNames, topos, args.outgroup if args.outgroup else None)
File "twisst.py", line 325, in makeTopoDict
output["topos"] = allTopos(taxonNames, []) if topos is None else topos
File "twisst.py", line 478, in allTopos
assert 4 <= len(branches) <= 8, "Please specify between 4 and 8 unique taxon names."
AssertionError: Please specify between 4 and 8 unique taxon names.

groups.tsv as:
ang_A ang
ang_B ang
Bta_A Bta
Bta_B Bta
bux_A bux
bux_B bux
der_A der
der_B der
eur_A eur
eur_B eur
imb_A imb
imb_B imb
ory_A ory
ory_B ory
scr_A scr
scr_B scr
spe_A spe
spe_B spe
str_A str
str_B str
syl_A syl
syl_B syl

Your advice would be much appreciated
Joro

Fatal error when writing weights file

The twisst run appears to be completing but then fails when it tries to write the outfile:

Traceback (most recent call last):
File "twisst.py", line 690, in
for x in range(nTopos): weightsFile.write("#topo" + str(x+1) + " " + topos[x].write(format = 9) + "\n")
File "/home/farman/anaconda_ete/lib/python3.6/gzip.py", line 260, in write
data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'

Analysis of topologies with low sample numbers

Hi Simon,

Thanks for the awesome twist!!
Can it be applied to three different cases - 1) low coverage data 2) unphased reference genomes or 3) low sample numbers per population (n=1)?

Since you mention it is dependent on allele frequencies, I wasn't sure if it's ok to naively assume frequencies of 0, 1, 0.5 for single samples.

Thanks,
Rohit

apparent error in x-axis of Twisst plots

Hi Simon, I have been playing around with your different functions in R. In particular, I have tried to make my figures somewhat less convoluted by subsetting my twisst data by scaffold and also by choosing only the top 2 or top 4 topologies. After playing around with these different functions, I noticed something unusual. When I subset by scaffold, the number of positions on the x-axis does not change. I was under the impression that the positions labeled on the x-axis of the twist plots are positions that span the genome and if you subset by scaffold then these would be positions that span the scaffold you have chosen. But regardless of how I subset my data, there are still 500 million positions on the x-axis. Do you have any idea why this might be? I have attached my weights file below if that helps

barbatulum_biallelic_weights.csv

Importance/impact of using phased data

Hi Simon,

The publication for twisst recommends that diploid reseq data should be phased if possible. I'm just wondering if is valid to use unphased data with the software, and if it is viable, what impact unphased data will have on the results from twisst.

Thanks for the help

Alastair

unphased data

I wonder whether I can use unphased data because my data is polyploid

Versioned release package for Twisst

Hi,

I'm working to make Twisst available on Bioconda.

Please refer. -> https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository#creating-a-release

I need your help creating a versioned release to use for the Bioconda recipe. Once Twisst is added to Bioconda, it'll also be made available as a Docker container from Biocontainers, and as a Singularity image from the Galaxy Project. The Bioconda bot will also recognize future releases and automatically update the recipe.

Please let me know

Thanks
Jay

How to deal with figure margins too large

plot.twisst(twisst_data_smooth, mode=2) #mode 2 overlays polygons, mode 3 would stack them
Error in plot.new() : figure margins too large
Calls: plot.twisst -> plot.phylo -> plot.default -> plot.new
Execution halted

About the numbers in the weights file

Hi, When I used Twisst to compute topology weights, I got a result file like this.
#topo1 (outgroup,((AQ,LO),GY));
#topo2 (outgroup,((AQ,GY),LO));
#topo3 (outgroup,(AQ,(LO,GY)));
topo1 topo2 topo3
topo1 topo2 topo3
0 1 0
1 0 0
1 0 0
1 0 0
0 1 0
0 1 0
0 1 0
0 1 0
1 0 0
0 1 0

The weights file only contains number 0 or 1 for each topo.
Is this an abnormal situation?
Why I didn't get the absolute counts for each topo?

Issue with generating trees in PhyML

Hello,

I'm working on conducting a Twisst analysis, but I'm running into issues creating sliding window trees in PhyML. I've been trying to run the phyml_sliding_windows.py script as follows (after running parseVCF.py):

python phyml_sliding_windows.py -g Mac2Renumberfortwisst.geno.gz --prefix output.phyml_bioml.w50 -w 50 --windType sites --model GTR --phyml /programs/phyml/bin/phyml

This outputs .trees and .data files, but the .trees file is much smaller than I think it should be (<100 B). When I feed that file into twisst.py, it seems that all the trees result in a "Warning - failed to read tree" message. Do you have any idea what could be causing the issue with PhyML or if there's an obvious issue in my command?

Thanks for your help.

Best,
Declan

error while using phyml_sliding_windows.py "No module named 'genomics'

I was able to use all the other tools to come from bams to trees. although when I reach the step to use phyml, it always fails with a missing module:

phyml_sliding_windows.py", line 19, in <module>
    import genomics
ModuleNotFoundError: No module named 'genomics'

I have phyml installed, I tried with the environment where the ETE toolkit is installed as well and I always go back to this missing module.

Any help to solve the issue?

Thanks

P.S. I tried the same with Raxml and I had the same problem (I have raxml installed as well)

Problem with import.twisst()

Hi, Simon
Thanks for your code!
I tried to plot with plot.twisst without the topology file, while I got Error massage like this:

source("../plot_twisst.R")
data.table 1.14.2 using 2 threads (see ?getDTthreads). Latest news: r-datatable.com
weights_file <-'msms_4of10_l1Mb_r10k_sweep.seq_gen.SNP.w50sites.phyml_bionj.weights.tsv.gz'
window_data_file <- 'msms_4of10_l1Mb_r10k_sweep.seq_gen.SNP.w50sites.phyml_bionj.data.tsv.gz'
twisst_data <- import.twisst(weights_file, window_data_file)
[1] "Reading weights and window data"
[1] "Number of regions: 1"
[1] "Computing summaries"
[1] "Cleaning data"
[1] "Getting topologies"
Error in names(l$topos) <- sapply(names(l$topos), substring, 2) :
NULL是不能有属性的
此外: Warning message:
In read.tree(text = topos_text) : empty character string.

My R version is 3.6.2 (2019-12-12)

Error in l$window_data[[i]]$mid :

When I run the codes in a R studio .
it was wrong with the errors as follows
Error in l$window_data[[i]]$mid :
$ operator is invalid for atomic vectors
In addition: Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable

Recommendations for reducing tip number

Hi Simon,

I'm interested in using TWISST to test a few hypotheses about support for different lineage relationships. My samples can be assigned to 12 lineages (including an outgroup) but not all lineages are relevant for all hypotheses. I have more than 4 samples per lineage, except for just one sample in the outgroup.

To reduce the number of tips, is it better to [a] remove taxa that are not relevant to the hypothesis being considered from the dataset or [b] lump lineages that descend from a common ancestor but are not directly relevant to the hypothesis under considering into one tip? I've summarized this in the image below, hopefully this helps clarify my question.
I would very much appreciate your thoughts on this,
Thanks, Inbar

Some strange results of topo weightings

Hi Simon,
Thank you for your great software and pipeline. It's really useful for my analysis.

I guess the sum of all topo weightings in one tree equals the product of the number of samples in each taxon (or N in your paper). But some trees in my data don't equal to the N while some trees do. I followed your pipeline and there was no error or warn when the scripts ran.

Here is the subset of my input file. The topo weightings of first tree seems OK because the sum of them equals N (66 * 62 * 134 * 98 = 53736144). But the second seems not good.

Do you have any idea why it happened?

My code is: python twisst.py -t subset.phyml_bionj.w100.trees.txt -w subset.output.weights.txt -g PopI -g PopVI -g PopII -g PopV --method complete --groupsFile groups.txt

Best wishes,
Junyi

subset.output.weights.txt
subset.phyml_bionj.w100.trees.txt
groups.txt

'Ambiguous node name

Hi Simon,

when I was doing weights inference, I was using diploids so the IDs in the tree I converted would have _A and _B suffixes, in order to ensure that the sample IDs were consistent, I removed the suffixes from the tree, and subsequent analyses would report the following error: ete3.coretype.tree.TreeError: 'Ambiguous node name: G3_PDS'; Similarly, I got a similar error after the same analysis with the heliconius92.chr18.500001-1500000.phased.vcf.gz file under twisst/examples: ete3.coretype.tree.TreeError: 'Ambiguous TreeError: 'Ambiguous node name: ros.CAM2059'.
I would very much appreciate if you could provide me with some advice to solve this problem,
Thanks, Yang

AssertionError: Please specify

Hello,

Thank you for your program!
I encountered an error when I ran TWISST in a 10-taxon data set.

assert 4 <= len(branches) <= 8, "Please specify between 4 and 8 unique taxon names."
AssertionError: Please specify between 4 and 8 unique taxon names.

However, I have assigned all taxon into groups.
-g R Lasent -g Cep Liginc -g Hex Pedetontus_sp_ -g Cope Calsin -g B Trilon -g Out Rsan -g Oli Parsp_ -g The Ppol -g Per Porpru -g De Lvan
Could you please recommend how to solve this problem ? I have attached the tree file for your reference.
Thank you very much for your kind help !
405-set2-10t-all-root.txt

Best,
Alex

Syntax Error line 663

I cannot run twisst, either on my linux server or locally on my mac, due to a syntax error:

  File "twisst.py", line 663
    print(".", end="", file=sys.stderr, flush=True)
                  ^
SyntaxError: invalid syntax

Any ideas on what's going wrong?

Cheers,
Mark

Error getting twisst.py to run

I encountered the same error. Were you able to identify the dependency that was missing?

File "./twisst/twisst.py", line 666
    print(".", end="", file=sys.stderr, flush=True)
                  ^
SyntaxError: invalid syntax

Installed versions:
ete3 v3.1.2
numpy v1.15.4
python v2.7.15

Thanks!

Originally posted by @skitchen19 in #14 (comment)

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Dear Simon,

I'm using your software twisst. But I got a problem which I can't handle. I hope you can help me. Thanks a lot.

Here is my scripts and erro information. And you can find the input files in attachment.

command line:

/gpfs/home/gaowei/anaconda2/envs/anaconda3/bin/python3 /gpfs/home/gaowei/00.biosoft/twisst-master/twisst.py -t test.tree -w res.test.output.weights.csv.gz --outputTopos res.test.topologies.trees -g NJ -g GHP -g GHV -g AKS --groupsFile group.AKS.txt --outgroup NJ -D res.test.dists.txt

erro infor:

Traceback (most recent call last):
File "/gpfs/home/gaowei/00.biosoft/twisst-master/twisst.py", line 626, in
simplify=not args.skip_simplify, abortCutoff=args.abortCutoff, verbose=args.verbose)
File "/gpfs/home/gaowei/00.biosoft/twisst-master/twisst.py", line 404, in weightTree
currentDists[taxPair[0],taxPair[1]] = currentDists[taxPair[1],taxPair[0]] = sum(leafLeafChains[comboPair[0]][comboPair[1]].dists)
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

Problem with twisst output

Hi Simon,

I'm a Postdoc working in Katie Wagner's lab and II have been running some preliminary twisst analyses on some trees (from sliding window raxml analyses). Everything runs fine, but when I look at the 'weights' output file I do not see weights or counts for each topology, only the topologies themselves followed by a series of rows that begin by listing the topology names and then shifts to nan.

Here is an excerpt:

#topo1 (kivu9	(((((sweya	kivu3)	kivu5)	kivu6)	kivu7)	kivu8));
#topo2 (kivu9	(((((sweya	kivu3)	kivu5)	kivu6)	kivu8)	kivu7));
#topo3 (kivu9	((((sweya	kivu3)	kivu5)	kivu6)	(kivu7	kivu8)));
#topo4 (kivu9	(((((sweya	kivu3)	kivu5)	kivu7)	kivu6)	kivu8));
#topo5 (kivu9	(((((sweya	kivu3)	kivu5)	kivu7)	kivu8)	kivu6));
#topo6 (kivu9	((((sweya	kivu3)	kivu5)	kivu7)	(kivu6	kivu8)));
#topo7 (kivu9	(((((sweya	kivu3)	kivu5)	kivu8)	kivu6)	kivu7));
#topo8 (kivu9	(((((sweya	kivu3)	kivu5)	kivu8)	kivu7)	kivu6));
#topo9 (kivu9	((((sweya	kivu3)	kivu5)	kivu8)	(kivu6	kivu7)));
.
.
.
.
topo1 topo2 topo3 topo4 topo5 topo6 ...
nan nan nan nan nan ...

I was wondering if you had any advice or thoughts about what could be going wrong. I have 74 individual samples broken into 8 taxa. I'll keep playing around with this, but I thought I would write to ask if you had any thoughts.

Thanks for the very useful scripts and thanks for your time.

Best

Chad

Per taxon sample size

Hello,

I notice that the simulations and various empirical examples you have run for TWISST all use multiple samples per designated taxon.

My question is simple: Is there reason to think TWISST would be limited, or at least still useful, when there is only one diploid individual per designated taxon. I have the diploid genomes of four species sequenced, but have only those genomes for variant calling (i.e. no resequenced individuals to increase sample size within species).

Thank you!

problem with plot.twisst

I am plotting small regions of a 73 samples dataset, but when I try to plot with plot.twisst including show_topos=TRUE it shows me the following error:
Error in layout(rbind(topos_layout_matrix, weights_layout_matrix), height = c(rep(1, :
too many columns in layout, limit 200

I've also tried to plot with small and large values of width and height

Topology affected by order of group definition arguments

Dear Simon,
I have encountered a strange behaviour of your program. Identical set of input trees is giving me different twisst results, and it has something to do with order of group definition arguments.
For example while using following command:
python $twisst --threads $nt -t ${base}.trees.gz -w ${base}.weights.csv.gz --outputTopos ${base}.topologies.trees $gr --method complete
where
gr="-g I I_364_A,I_364_B,I_370_A,I_370_B,I_380_A,I_380_B -g D D_001_A,D_001_B,D_002_A,D_002_B,D_013_A,D_013_B -g W W_001_A,W_001_B,W_009_A,W_009_B,W_025_A,W_025_B -g M M_182_A,M_182_B,M_192_A,M_192_B,M_208_A,M_208_B -g P P_001_A,P_001_B,P_016_A,P_016_B,P_017_A,P_017_B -g S S_001_A,S_001_B,S_005_A,S_005_B,S_140_A,S_140_B -g C C_254_A,C_254_B,C_256_A,C_256_B,C_270_A,C_270_B -g F F_272_A,F_272_B,F_276_A,F_276_B,F_280_A,F_280_B "
These are the best topos:
#topo1 (F,((((((I,D),W),M),P),S),C));
#topo2 (F,((((((I,D),W),M),P),C),S));
#topo3 (F,(((((I,D),W),M),P),(S,C)));
#topo4 (F,((((((I,D),W),M),S),P),C));

And if I change gr:
#gr="-g C C_254_A,C_254_B,C_256_A,C_256_B,C_270_A,C_270_B -g F F_272_A,F_272_B,F_276_A,F_276_B,F_280_A,F_280_B -g I I_364_A,I_364_B,I_370_A,I_370_B,I_380_A,I_380_B -g M M_182_A,M_182_B,M_192_A,M_192_B,M_208_A,M_208_B -g P P_001_A,P_001_B,P_016_A,P_016_B,P_017_A,P_017_B -g S S_001_A,S_001_B,S_005_A,S_005_B,S_140_A,S_140_B -g W W_001_A,W_001_B,W_009_A,W_009_B,W_025_A,W_025_B"
then
#topo1 (W,(((((C,F),I),M),P),S));
#topo2 (W,(((((C,F),I),M),S),P));
#topo3 (W,((((C,F),I),M),(P,S)));
#topo4 (W,(((((C,F),I),P),M),S));

The last taxon in the line of -g arguments is always taken as an outgroup and the topology mirrors the order as well. The topologies from twisst do not make much sense biologically while input trees seems to be OK. Am I doing something wrong, or is it an expected behaviour, or is it a bug?
Thanks in advance,

Jakub

PS: in the second example I missed one taxon, but even if I do not miss it the strange behaviour remains.

ValueError: 'd223af7fbf34f5a5111bb1542d8b925c' is not in list

Hi @simonhmartin,
Thank you for a great program!
I have run twisst many times without an error, but now seem to be receiving error messages on a few trees. Initially I used BioNJ trees with the same individual sample labels and had no issues. These BioNJ trees did not have bootstrap values. I have switched over to using max likelihood trees generated with RAxML and FastTree. FastTree produces aLRT support values, could this be causing the below errors?
thanks,
scott

Traceback (most recent call last):
File "/afs/crc.nd.edu/user/s/ssmall2/anaconda2/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/afs/crc.nd.edu/user/s/ssmall2/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/twisst/run_twisst_parallel.py", line 44, in weightTree_wrapper
weightsData = twisst.weightTreeSimp(tree=tree, taxa=taxa, taxonNames=taxonNames, topos=topos, getDists=getDists, abortCutoff=args.abortCutoff)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/twisst/twisst.py", line 383, in weightTreeSimp
x = topoIDs.index(prunedID)
ValueError: 'dd1ccb7098f363bc0cab5e347c4987a7' is not in list
2234 trees queued | 2182 results received | 1664 results written.
4368 trees queued | 4317 results received | 1664 results written.
Process Process-4:
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/s/ssmall2/anaconda2/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/afs/crc.nd.edu/user/s/ssmall2/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/twisst/run_twisst_parallel.py", line 44, in weightTree_wrapper
weightsData = twisst.weightTreeSimp(tree=tree, taxa=taxa, taxonNames=taxonNames, topos=topos, getDists=getDists, abortCutoff=args.abortCutoff)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/twisst/twisst.py", line 383, in weightTreeSimp
x = topoIDs.index(prunedID)
ValueError: '5b339154d31bc114b9db4f34f955306e' is not in list
6528 trees queued | 6476 results received | 1664 results written.
Process Process-1:
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/s/ssmall2/anaconda2/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/afs/crc.nd.edu/user/s/ssmall2/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/twisst/run_twisst_parallel.py", line 44, in weightTree_wrapper
weightsData = twisst.weightTreeSimp(tree=tree, taxa=taxa, taxonNames=taxonNames, topos=topos, getDists=getDists, abortCutoff=args.abortCutoff)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/twisst/twisst.py", line 383, in weightTreeSimp
x = topoIDs.index(prunedID)
ValueError: 'd223af7fbf34f5a5111bb1542d8b925c' is not in list

suggestion for plot.twisst in R with > 5 sp

In R, plot.twisst works great with 5 or less species, but when I add an outgroup the image becomes unreadable due to the number of possible topologies. Would it be possible to set plot.twisst to only use a set number of the most heavily weighted topologies for visualizing? When looking at the boxplots of weights there is consistently <10 topologies that are making any sort of notable contribution.

My solution at the moment is clunky. I was able to subset the weights file by top weighted columns in R.

data <- read.table(original_weights_file, header=TRUE)
best_topos <- colnames(data)[sort.list(colSums(data[,-1]), decreasing=TRUE)[1:10] + 1]
new_weights <- data[best_topos]
write.table(new_weights, "top10.weights.tsv", quote = FALSE)

Then manually add back in the top 10 topologies and read this new file in as the weights file. From this method, I can get a readable boxplot with topologies (albeit the numbers are now slightly skewed because I'm only using 10 topologies, not 105). But, this way is cumbersome and looks to cause problems with plot.twisst (topologies were mismatched). Is there another solution?

Thanks,
Rebecca

Viewing Rooted Topologies in R Code

Hi Simon,

I am currently using Twisst (which seems like a fantastic program!) to look at the weighted topologies of trees across chromosomes in my study group. Including an outgroup, there are ~6-8 groups (depends on how you view it, but there is definitely hybridization) and 3 ingroup species. For simplicity and for the sake of just getting the code to work, I am running Twisst with 4 groups: Outgroup, C, V, O. I successfully ran Twisst on one chromosome, which included a set of 961 bootstrapped gene trees from RAxML-NG.

Upon running this in the R code for visualization, I noticed that the output trees are unrooted. I know this is expected output from Twisst and in the publication it mentions that your trees from the weights.tsv.gz file are unrooted, but rooted for the sake of visualization in the figure (attached below is our data with the C, V, O, and Outgroup showing unrooted topologies).

I noticed in the downloadable weights.tsv.gz from Github, though, you seem to have rooted trees for looking at the weights. How did you get rooted trees from your analysis? We are pretty certain we know the 'true' topology of these organisms, and it would be helpful to know if there is a way to visualize at least the outgroup as an outgroup in our graphs.

I also am wondering, is the interpretation the same? If you look at the most weighted topology in the image I attached of our own data run, despite it showing an unrooted tree, can I still interpret this as the most abundant topology being (Outgroup,(S,(V,O)))?

Also, I have one other question, which is a bit more basic on how Twisst works. In the image below, it shows a distribution of weights for the three topologies when I have gene trees with 4 groups. How does Twisst calculate multiple numbers for a single position (which would be a gene tree). I ran Twisst on a single gene tree with 4 groups expecting to get a weight distribution of 1,0,0 since there is only one tree topology for a given gene tree (instead I saw numbers that showed multiple topologies for a single gene tree) . But since Twisst works on subtrees, I think I am misinterpreting how the algorithm works and how to interpret the results. How do we get 2-3 numbers as weights for each gene tree provided?

If it helps, here is the code we ran:

python twisst.py -t iqtree.genetrees.tre.gz -w output.weights.csv.gz --outputTopos \ output.topologies.trees \
	 --method complete \
	-g S \
	-g V \
	-g O \
	-g Outgroup \
	--groupsFile group-file.txt

Here is the groupFile we made:

266_og	Outgroup
261_og	Outgroup
263_r	S
262_r	S
267_r	S
269_r	S
263_r	S
267_r	S
268_r	S
266_r	S
268_r	V
268_o	V
268_o	V
261_o	V
265_o	V
266_o	V
266_w	O
265_w	O
263_w	O
265_w	O
261_w	O
260_w	O
267_w	O
263_w	O
261_w	O
269_w	O
266_w	O
266_w	O
267_w	O
264_w	O
263_w	O
264_w	O
262_w	O
266_w	O

Thank you so much! This program is looking really useful for this project, and I am looking forward to understanding it better!

How to set groups

Traceback (most recent call last):
File "/home/mole07/workspace/chongcexu/twisst/twisst.py", line 546, in
assert len(args.group) >= 4, "Please specify at least four groups."
AssertionError: Please specify at least four groups.

This is what I set up groups.tsv.
HS5 A
HS5 A
HS5 A
HS5 A
XH1 B
XH1 B
XH1 B
XH1 B
JT10 C
JT10 C
JT10 C
JT10 C
DH1 D
DH1 D
DH1 D
DH1 D
I don't know what went wrong. Ask for help.

groups.tsv

This is what I set up groups.tsv.
HS5 A
HS5 A
HS5 A
HS5 A
XH1 B
XH1 B
XH1 B
XH1 B
JT10 C
JT10 C
JT10 C
JT10 C
DH1 D
DH1 D
DH1 D
DH1 D
I don't know what went wrong. Ask for help.

is the 'lnL' in data.tsv.gz the likelihood ratio when building tree ?

Hi simonhmartin @simonhmartin ,
here is a small question , is the lnL in data.tsv.gz the likelihood ratio when building tree with raxml or phyml?
Thanks

erro twisst_data_smooth

twisst_data_smooth <- smooth.twisst(twisst_data, span_bp = 1000000, spacing = 50000)
Error in seq.default(twisst_object$pos[[i]][1], tail(twisst_object$pos[[i]], :
wrong sign in 'by' argument
Calls: smooth.twisst -> seq -> seq.default

AssertionError: Named samples not present in tree.

Hello,
I am trying to run the Twisst with my set of gene trees (3963), created with IQTree (newick format). I have created these trees with 84 individuals belonging to 8 different groups. And for some of the genes, the individuals with high missing data (NNs) were excluded before creating the trees. That's why not every gene tree includes all of the individuals. Additionally there is no consensus that can specify just one or two specific individuals were excluded in the trees, this is changing among the trees.
I know in the tutorial it is written that "All trees must contain all specified individual names as tip labels”
However, i have run the twisst anyway and i am getting the weighting and topology outputs (weighting file without values, just topologies). And, in the end of the report file it says ;

Traceback (most recent call last):
File "/home/hpc/pr62ba/di49pey3/apps/phylogenomics/twisst/bin/twisst.py", line 735, in
assert namesSet.issubset(leafNamesSet), "Named samples not present in tree."
AssertionError: Named samples not present in tree.

So, in this case, is there a way that can fix this, or twisst can not used for my data ?

The code i have run is like following ;

twisst.py -t [All_gene_trees] -w weightings_output --outputTopos topologies_output -g [species_1] -g [species_2] -g [species_3] -g [species_4] -g [species_5] -g [species_6] -g [species_7] -g [species_8] --outgroup [outgroup_species_name] --groupsFile [grouping] --method complete

Best,
Burçin

plotting specific regions within chromosomes

Hi Simon, do you have a suggestion how I can specify in twisst.plot plotting the weights for specific regions within a scaffold/chromosome? For example, If I want to plot the region within 1000-50,000 bp of chromosome/scaffold 3, how could I specify it? I tried subset.twisst.by.regions but only succeeded subsetting by chromosome/scaffold. Thank you!

Weight calculation fails mid-analysis

Hello,

I've been using twisst for a few projects recently, but keep running into this error on my most recent set of analyses after switching to phyml for constructing trees:

.Traceback (most recent call last): File "/home/tkess/twisst/twisst.py", line 637, in <module> simplify=not args.skip_simplify, abortCutoff=args.abortCutoff, verbose=args.verbose, taxonNames=taxonNames) File "/home/tkess/twisst/twisst.py", line 359, in weightTree rootLeafChains = makeRootLeafChainDict(tree, collapseDict=taxonDict if simplify else None, preserveDists=getDists, treeFormat=treeFormat) File "/home/tkess/twisst/twisst.py", line 300, in makeRootLeafChainDict chains = getChainsToLeaves(tree, collapseDict=collapseDict, preserveDists=preserveDists) File "/home/tkess/twisst/twisst.py", line 158, in getChainsToLeaves childrenChains = [getChainsToLeaves(child, collapseDict, preserveDists) for child in children] File "/home/tkess/twisst/twisst.py", line 158, in <listcomp> File "/home/tkess/twisst/twisst.py", line 154, in getChainsToLeaves chain = NodeChain([node], dists = [] if preserveDists else None) File "/home/tkess/twisst/twisst.py", line 62, in __init__ super(NodeChain, self).__init__(nodeList) RecursionError: maximum recursion depth exceeded while calling a Python object

I get this error after getting weights for ~1000 windows, so something is going wrong during the weights calculation mid-analysis. Has anyone seen something like this before?

Tony

How to interpret the results(output.weights.csv)

Hi Simon:
After i run : python twisst.py -t B16_sonic_953pep_iqtree_root.nwk -w output.weights.csv --outputTopos topologies.trees --method complete -g D 1,2,3 -g C 4,5 -g B 6,7,8 -g A 9,10,11.
i got the result file :output.weights.csv

But i do not understand thoes numbers below each "topo".
Thanks!

Tutorial question

In the tutorial starting from the bam files, the GenotypeGVCF call includes invariant sites, and then they are immediately filtered out by the bcftools step. Is there a reason for this?

simonhmartin / twisst Goto Github PK

twisst's Introduction

Twisst: Topology weightying by iterative sampling of sub-trees

Papers

Code

Input

Output

Specifying taxa and individuals

Weighting Method

Pipeline to generate the input trees file

My pipeline from BAM to trees

twisst's People

Contributors

Stargazers

Watchers

Forkers

twisst's Issues

Recommend Projects

Recommend Topics

Recommend Org