matsengrp / ecgtheow Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 2.0 178 KB

Ancestral lineage reconstruction using BEAST or RevBayes

Python 100.00%

ecgtheow's People

Contributors

Stargazers

Watchers

Forkers

psathyrella standardgalactic

ecgtheow's Issues

For how long do we have to run the chains?

What sort of convergence should we be seeing in terms of ESS of various parameters in order to think that the ancestral lineage graph is stable (i.e. it won't change with longer MCMC runs)?

Poset plotting layout needs help

When we run the poset plotting it looks like this:

Can we do some randomization so that the labels don't overlap? It seems that we could perhaps take the default layout, save it, then randomize it a little using layout_extend_randomly? I couldn't immediately find a way to save the layout, but I'm sure it's easy.

We need blobs!

Since our last meeting, I've been pondering what to do with highly uncertain regions of the graph, such as this beast copied from @lauranoges ' thoughts on #12 :

In our last conversation I realized that this could happen when even in the absence of tree topology uncertainty-- it can just be from having long branches and a given branch not knowing where to attach. In this case we can have substantial uncertainty high in the graph, even though it contracts back down to more certain nodes lower in the graph. Even though this isn't a problem in principle, in practice it makes our graphs impossible to look at and interpret.

Recall that each node in the graph represents exactly one sequence. We might have more interpretable results by generalizing such that nodes can represent a collection of similar sequences. I propose that we call these "blobs", and use some other shape to represent their contracted forms in our graph.

How do we find an appropriate blobification? My sense is that we want to infer a blob when flow through the directed graph first separates and then comes back together under Markov flow. This kind of reminds me of MCL but it's simpler because we mostly have a DAG here. That algorithm doesn't work with directed graphs.

We can think about that more, but assuming we can do that does this seem like a good idea?

try and use stochastic mapping from BEAST

reminder to @dunleavy005 to check this out

Rogue taxon identification

In Bayesian phylogenetics, "rogue taxa" are taxa that move around the tree. I definitely think that this is what we were seeing with the long branches in the 125 set. See, e.g. https://scholar.google.com/scholar?q=%22rogue+taxa%22&btnG=&hl=en&as_sdt=0%2C48

How shall we proceed? We could try some existing solution, or directly look for this effect in terms of things moving on and off the lineage. If there's too much uncertainty and they have long branches, perhaps they get pruned?

Rewrite partis input code

Background: @metasoarous and Duncan have worked really hard to make the CFT data flow better. Something about yaml.

Now, however, we'll need to get @metasoarous 's cooperation in re-writing the ecgtheow file input since we used to use the pandis creation (by @matsen ) that took partis output and transformed it into a healthy fasta file of input sequences for BEAST.

@metasoarous , we'll need the pieces of your code that turn the seed clusters within the best partitions (and best-plus-X partitions, if applicable) into healthy, seed-lineage-pruned fasta files.

When can I sit down with @metasoarous and @dunleavy005 to make this happen?

compare inferences to previous ones

With 542edad the pull-ignored.sh script should pull in some older inferences from @lauranoges and company. We would like to compare our existing inferences using ecgtheow with these previous inferences. Laura can give details about these files.

My proposal is to make a FASTA diffing tool. It should be simple.

Play around with Biopython's FASTA loading, loading in some FASTAs. Note that the sequences won't be sequences, but will be objects. If needed you can get raw sequence strings back from a seq_obj by using str(seq_obj).
Turn these into dictionaries that are keyed on the sequence and take values as ids. Not tested code, but this will look like seqs = {str(r.seq): r.id for r in SeqIO.parse(args.fasta_path, "fasta"))}
Load in two of these, then look for intersection and symmetric difference. The slickest way to do this is to use set operations after turning the keys into sets.
The desired output is something like the following:

in both:
naive0 <-> naive0
one_sequence_name <-> another_sequence_name

in filename1:
a_seq
another_seq

in filename2:
b_seq

if you want to get fancy you could come up with some serialization format.

Another fancy option would be to integrate this into the ecgtheow graph display, in which a specified set of sequences (with the same naive and seed sequences) were added to the graph as needed, and highlighted with special symbology (such as being boxes rather than circles. But that seems like a reasonable second step given the results of the first.

health metrics agreement

Ahh, I'm freaking out a little this morning because I'm worried that the "health metrics" (healthy_to_fasta.py) we apply in the ecgtheow run_beast_asr.sh script are not the same as the health metrics we apply in the CFT pipeline.

Stop codons
Last word in CFT is that I maintained that 1 STOP codon could be allowed, but in the ecgtheow project, I determined that I wanted no STOP codons allowed. I forgot that ecgtheow was operating independently. @metasoarous and @dunleavy005 we need to make sure we're on the same page here -- and maybe using/updating the same scripts?

Indel reversal
I can't tell for sure, but I think we're still using indel reversal in ecgtheow healthy_to_fasta.py? I'm not sure what the latest decision was regarding indels @matsen ? I remember they messed up some trees but that we weren't sure if we wanted to ignore them quite yet. I was in favor of ignoring them completely at one point, but that was more of a "want" not a logical/conservative decision.

Try out ancestral sequence reconstruction with RevBayes

I'm becoming less enthused about BEAST's clock assumptions, and suspect that we may get more stable results by using unrooted trees.

The ever helpful and friendly Andy Magee has supplied us with an ASR example for RevBayes.

This example is for a fixed topology. Apparently, to make it change topologies, add these two lines:

moves[++mi] = mvNNI(topology, weight=10.0)
moves[++mi] = mvSPR(topology, weight=5.0)

before this line: my_model = model(Q)

And remove this line: topology.clamp(true_tree)

validate!

Of course, the next step after checking out the previous inferences is to validate on simulations.

This would be using @krdav 's framework.

I hope to see some plot relating inferred posterior probability of a sequence to how likely it is to be correct (i.e. in the simulated tree).

Try data set variants

There are several processing steps that go into getting a data set ready for BEAST. These are automated in CFT, but given that we want to try variants we have to do them by hand.

Install dependencies

conda install pandas
conda install -c bioconda colorbrewer
pip install seqmagick

git clone [email protected]:matsen/pandis.git
git clone [email protected]:matsengrp/cft.git

(Below I assume that you have cloned these in a ~/re/ directory, but change below depending on your layout.)

Getting a clonal family

 cp /fh/fast/matsen_e/processed-data/partis/laura-mb/v9/seeds/BF520.1-igh/BF520-h-IgG/run-viterbi-best-plus-0.csv BF520.1-h-IgH.csv

note that light chain is next door.

Extracting clonal family sequences

Note that we should vary the definition of "healthy" to see if we can get more intermediate sequences on the naive-to-seed path.

~/re/pandis/transpose_family.py BF520.1-h-IgH.csv
~/re/pandis/healthy_to_fasta.py BF520.1-h-IgH.family_0.csv
~/re/pandis/get_naives.py BF520.1-h-IgH.csv >> BF520.1-h-IgH.family_0.healthy.fasta

Pruning the tree

FastTree -nt BF520.1-h-IgH.family_0.healthy.fasta > BF520.1-h-IgH.family_0.healthy.tre
~/re/cft/bin/prune.py --naive naive0 --seed BF520.1-igh BF520.1-h-IgH.family_0.healthy.tre -n 100 --strategy seed_lineage > BF520.1-h-IgH.family_0.healthy.seedpruned.100.ids
seqmagick convert --include-from-file BF520.1-h-IgH.family_0.healthy.seedpruned.100.ids BF520.1-h-IgH.family_0.healthy.fasta BF520.1-h-IgH.family_0.healthy.seedpruned.100.fasta

This will give you a fasta you can use to include into the BEAST xml file, or build trees in Geneious!

naive0 missing from graphic output

From slack:
@dunleavy005 I generated a tree (graphic) that had no naive sequence at the top, so I’m suspicious that the prune.py script is not working properly to keep the naive0 sequence. @csmall just mentioned this as a potential problem in our CFT work, so maybe it’s a problem here? My question for you is: does the prune.py script get updated regularly and/or have there been any changes?

this was from laura-mb dataset

I think the problem we’re experiencing with the current output is that not all the lineage pathways are beginning from naive0. I reduced the filter to 50 instead of 100 and got this graphic:

@dunleavy005 where do we specify that all the lineages of interest should start from --naive and progress to our --seed?

Be able to run different data sets, heh

Right now we have a single BEAST .xml file, at

/home/matsengrp/working/matsen/ecgtheow/runs/2017-07-10/BF520.1-h-IgH.family_0.healthy.tre.seedpruned.100.ids.xml

This clearly needs to be made more flexible.
Although we could struggle along with hand-coded xml generation, the right thing is to use a templating language. I very much suggest Jinja2.

Chris Warth has set up a much more sophisticated example than what we have.

I'd take a look at his stuff for inspiration, but build up from minimal examples rather than try to reduce from what he has there.

Please commit templates and code, but again not data.

Improving display of graph

Here's how things look on a 10,000 sample run, cutting out 1000 burn-in samples:

It seems to me that the next step, and perhaps a good place for @dunleavy005 to step in, is to do shading in proportion to confidence. Note that GraphViz supports RGB colors. I'd suggest just using Red-Green-Blue-Alpha where alpha varies according to confidence. If we wanted to drag in extra dependencies, we could use Matplotlib's color schemes, but that seems overkill here.

It would be nice to have this shading for both the edges and the nodes. For the nodes we should consider how to do shading while still being able to see the text.

I also think it might be nice to label the edges of the graph with the AA mutations. There is code to do this sort of thing in tabulate_mutations.py:

def find_muts(orig, mutated):
     return [
        "{}{}{}".format(o, idx+1, m)
        for idx, (o, m) in enumerate(zip(orig, mutated)) if o != m]

Test clock model

We should run a Bayes Factor test for strict vs uncorrelated lognormal clocks. I don't think that we have enough data to support a more complex model, but worth checking out.

BEAUTi to incorporate timepoint info

Now that we have timepoint-merged samples that contain up to four different timepoints for the sequences, should we try BEAUTi templates that incorporate timepoints? (instead of @dunleavy005 's chosen beast_template.xml that we're running now). We'd need to pull the timepoint information for each sequence from each sequence's name (or perhaps @metasoarous could advise us on yaml use for this purpose). @dunleavy005, thoughts?