Giter Site home page Giter Site logo

naturalis / barcode-constrained-phylogeny Goto Github PK

View Code? Open in Web Editor NEW
1.0 9.0 3.0 3.48 MB

Pipeline for building topologically-constrained phylogenies from DNA barcode data

Home Page: https://naturalis.github.io/barcode-constrained-phylogeny/

License: Apache License 2.0

Python 97.47% Shell 2.53%
dna-barcoding metabarcoding phylogenetic-diversity phylogenetic-placement phylogenetics

barcode-constrained-phylogeny's People

Contributors

annemiekeschonthaler avatar jwesseling avatar n-scheffer avatar naomivanes avatar ni-s avatar rvosa avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

barcode-constrained-phylogeny's Issues

Outgroup selection with blast

Plan:

  • progressively index unaligned.fa, only those sequences that have ott IDs with makeblastdb -in {unaligned.fa} -dbtype nucl -out {blastdb} -parse_seqids
  • when add outgroups, blast while ignoring the ingroup and only keep n outgroups, e.g. blastn -query {unaligned.fa} -db {blastdb} -out {outfile.txt} -outfmt 6 -max_target_seqs n -negative_seqidlist {exclude_ids.txt}
  • incorporate operations in snakefile
  • include outgroups in constraint tree

Barcode database has no indexes or primary keys

The file BOLD_COI-5P_barcodes.db has no indexes. For example, the column barcode.taxon_id should be indexed so that joins with taxon.taxon_id go MUCH faster. Also, taxon.taxon_id seems like it should be a primary key but it is not unique.

Submit to WorkflowHub

At time of writing, the pipeline appears technically ready so submission to WFH except for the testing, which might be doable by generating tests from the results (?)

Skip `temporary_database.db` step

A rule in the Snakefile has as an output target a file (temporary_database.db) that is renamed later on. Once that renaming has happened, the output target has disappeared and so snakemake decides to rerun everything to recover that target. This needs to be fixed by bypassing this renaming stuff and going straight for the final database file name.

Reroot unrooted raxml subtree

When the outgroup of the subtree is not monophyletic with respect to the ingroup, RAxML issues a warning but then merely writes out the unrooted tree. We can still root this with fairly defensible logic but this needs to be applied such that there is a tree file for the next steps to work with.

OpenToL 'broken' taxa

When requesting a subtree by its tip ott IDs, there are cases where one or more of the IDs are for species that OpenToL views as 'broken'. For example, Alouatta seniculus, or Callicebus personatus.

This is because at the subspecies level they are entangled with other species (respectively Alouatta sara and Callicebus coimbrai). In the requested subtree, such broken tips are only identified by a label of the form mrcaott126846ott126847, which means we can't immediately figure out which species triggered this.

The solution appears to be to:

  1. parse the label to get the IDs (126846 and 126847)
  2. requesting the induced subtree for those IDs
  3. fetching the tip labels

Whichever label corresponds for the first parts with a species in the input (unaligned.fa) that hasn't otherwise been seen in previous output should be the one to keep and to graft onto the tree.

Remove dashes (-) FASTA files

HMMer can’t align sequences which have dashes in them, remove the dashes before putting them into a FASTA file.

Change raxml prefix

To better organize the raxml ouput, add a directory for each family within the raxml/ directory

Rule family_fasta

Not sure how to run this one: for me it finishes in a second without extra output. Do I have to specify which family or sth?

Roadmap next release

  • refine input sequence curation as per BGE 10.1 criteria
  • make OpenTol coverage denser by ssp expansion/contraction
  • ensure outgroup selection avoids sequences within ingroup distances
  • test exemplar selection: tallest, shallowest, average
  • cleanup log file proliferation
  • run on Odonata, then Mammalia, then Lepidoptera
  • fix rooting/grafting
  • fix database import
  • finish workflow/documentation.md

RAxML placement and branch length optimization

It should be possible to use RAxML slightly different. Summarizing this thread, the idea would be to:

  1. Do phylogenetic placement of any family-level sequences that are not in the opentol
  2. Then do branch length optimization
    This obviates the need for outgroup selection and any ensuing rerooting issues.

MACSE to HMM

Hmm does not cut the headers short as MACSE did. Is it safe to delete replaceme alignment id script/rule?
Also the MACSE files are as of now still used in the snakemake file so this can be changed to hmm generated files. Wait untill we have a matk/rcbl model too?

Make families.txt's for both COI-5P familys and matK_rbcL

Some families are too large right now, we can specify which families (like all mammalia families) we want to use. Save a list with family names to family_COI-5P.txt and family_matK_rbcL.txt which can be used as wildcards in snakemake.

ERROR: Cannot find tree species

When making a subtree (create_tree.py) using a newick tree from OTL this error pops up for the first ott id (ott406633). The id is present in both the newick constraint tree and the input fasta file.

raxml with parwise alignments

raxml only takes msa file inputs, how the rule raxml is right now it will not work with the new hmm alignments.
Does it work on raxml branch with hmm alignments @NI-S ?

Correspondence between constraint tree and alignment

This is not so much an issue but an elaboration that perhaps should end up in various reports and documentation files. This is about how the tips in a constraint tree must logically correspond with the sequences in an alignment. When errors occur surrounding this topic, then this might be a reference in trying to resolve the issue. Here are some possible patterns in which tree and alignment relate to each other:

  1. There is no constraint tree at all. This is actually the most common way in which raxml is used. You could rephrase this situation as one where there is a constraint tree but it has zero tips. That may sound trivial but it's worth keeping in mind that the tool normally places sequences in a tree based only on the signal that's in the DNA so there's no need for any tips to be in a constraint.
  2. There's a constraint tree with some taxa from the alignment. In this case the constraint tree has a subset of the tips that are in the alignment. This is fine because it just means that there is a bit more signal besides the DNA, for example to make it so that a few tips are forced to be next to each other. This is basically what we're going for.
  3. The constraint tree has tips that are not in the alignment. This is impossible so it produces errors. The tree searcher is now being asked to place tips in a tree but it has no DNA data to do so. Tips in the constraint tree absent from the alignment must be pruned.
  4. The constraint tree or the alignment has multiple entries with the same name. This is also impossible. The solution here is to make the constraint tree have a polytomy (e.g. ((A1,A2,A3,A4),...)) and make the alignment also have A1, A2, A3 and A4 (instead of multiple, identical As).

I think that's all the possible arrangements :)

Documentation

I put some placeholders in the top-level README.md for bits of information that should be added. It would be good if that document could be updated. The idea is that this document is the 'landing page' for the project, so (potential) users will want to know quite a bit.

Use logging, not print statements

import logging

# Create a custom logger
logger = logging.getLogger(__name__)

logger.warning('This is a warning message')
logger.error('This is an error message')

Incompatible dependency graphs across rules

The failures in the github CI/CD workflow indicate that different tools end up resolving into different dependency versions (e.g. in libxml2). The way to address this would be to insulate the snakemake rules from each other and have them sit in separate conda environments or docker containers.

hmmalign parameterization

The current wrapper around hmmalign has resulted in unaligned sequences (e.g. in Naomi's report). It's possible that this is because the produced output, in Stockholm format, is not parsed and interpreted correctly. What looks less error prone is to use a different output format. For example, with hmmalign --outformat afa [options] [hmm] [query] the result is in aligned FASTA. A quick check with some variable sequences (quite distantly related primates, with COI sequences of different lengths) resulted in pretty good alignments nonetheless:

Screenshot 2023-11-08 at 16 58 19

The backbone logic

The following consideration factor into the inference of the backbone:

  • For each family F, the two tips F1, F2 that are most distant must be the exemplars for that family. This ensures that they will be on opposite sides of the root so that they form a fork Y in the backbone that can be replaced by the subtree for that family. The approach using the distance matrix is good for this. Within the constraint tree for the backbone, the two exemplars will be defined as a monophyletic group in the newick syntax, i.e. ...(F1,F2),....
  • For the relationships between the families (F,G,H,I), we want constraints from the OpenTree backbone. The challenge here is that F1 and F2 might not be in the OpenTree backbone. Instead, any single one of the tips from the families that did occur in there (e.g. any of the ones that went into the family-level constraint) will do. This might give a constraint like ((F,G),(H,I));. That constraint then needs to be updated so that for each family, the exemplars are grafted in, e.g. (((F1,F2),(G1,G2)),((H1,H2),(I1,I2)));.

Tip1: you can copy the statements that end with ;, such as (((F1,F2),(G1,G2)),((H1,H2),(I1,I2))); and paste them directly into figtree to see the tree. Do this to check the validity of the trees. They can't just be single Y splits.

Tip2: you can check the alignments in mesquite so that you see the amino acids, not the nucleotides, like so: https://www.youtube.com/watch?v=YsotPrE0KwE - do this to make sure the alignments are still OK.

Change alignment method

Use HMMer to align sequences to a COI-5P model with pairwise alignment for scalability. With the output of HMMer the sequences can be checked if its forward or reverse.

  1. Get refs.hmm from zip file https://github.com/psomervuo/FinPROTAX and save it in repo
  2. HMM align using profile and fasta file
  3. Use stockholm file to check which one of the is the forward sequence (most * in alignment)

Rough to do list/workflow

  • Make rough ERD
  • BOLD data dump -> SQLite (with marker code/kingdom filter as commandline input), do not index yet
  • Filter and rearrange data into ERD tables (barcode records -> atleast family taxonomy available, …)
  • Index data (unique ID for every barcode and taxon)
  • Map opentol ID to taxonomy table (with family name as constraint!)
  • Get barcode data into chunks (group by family name) and make per family a fasta file with barcodes (header with own DB id from barcode table) | OUTPUT FASTA?
  • Make alignment per family (MUSCLE? HMER? For HMER nt -> aa) | OUTPUT PYLIP?
  • Make tree per family/alignment, use opentol tree as constraint (get tree with opentol id) | INPUT PHYLIP & OUTPUT NEWICK?

Handling empty FASTA files

if file is empty in the results/fasta/family/ directory it will still be used and iterated over in the family fasta steps. Current fix: delete the file.

Open Tree of Life - extracting subtrees

Here is a SQLite representation of the opentree 13.4 supertree labelled with ott IDs. From this database, the DBTree toolkit can be used to extract subtrees with the megatree-pruner command. This will produce a newick tree. This tree still needs to be expanded so that tip labels that occur multiple times in the alignment are expanded into polytomies to disambiguate for raxml (see #27)

Add date/time stamp to log messages

import logging

# Create a logger
logger = logging.getLogger(__name__)

# Set the log level
logger.setLevel(logging.DEBUG)

# Create a console handler
handler = logging.StreamHandler()

# Define the format for the log messages,
# including the date and time stamp
log_format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'

# Set the date format
date_format = '%Y-%m-%d %H:%M:%S'

# Create a formatter using the specified format and date format
formatter = logging.Formatter(log_format, datefmt=date_format)

# Set the formatter for the handler
handler.setFormatter(formatter)

# Add the handler to the logger
logger.addHandler(handler)

# Example log message
logger.debug('This is a debug message.')

Picking exemplars

When picking the shallowest exemplars for the backbone, exemplar clades where shallow tips have near zero distance to their subtree root result in a backbone where nearly all change is allocated to the root branch of the of the exemplar clade. When grafting the subtrees onto the basal split, the subtrees become "too tall". Here's the grafted example:

Screenshot 2024-01-11 at 12 33 27

Here's the same topology but with the branch lengths re-estimated by raxml:

Screenshot 2024-01-11 at 12 34 17

A follow up experiment should be to pick the tallest exemplars instead. It's possible that this will make them "too shallow" instead, especially when there is a lot of homoplasy within the clade.

Additional indexes needed

To be able to generate area phylogenies, such as the tree for the Netherlands, the country field in the barcode should be indexed.

Check barcode length

Do a check for barcode length when making FASTA files (minimum 600 for COI-5P), make it adjustable in config.yaml

tree inference & data management

It is probably a good idea to run the steps incrementally, by which I mean: writing the alignments is a task that you can kill and then it resumes where it left of, until all families have an alignment. Same for the constraint trees: you probably want to have a tree for every alignment and store this in a folder structure. What you want to avoid is having to run everything in one giant loop that can never fail.

With that in mind, it's probably better to not have a temporary tree that has a throwaway name here, but rather have a system where the raxml run assumes that there is a constraint tree for every alignment (and they probably have the same name but with different extensions).

FIX: No constraint tree needed for families with one ott

Raxml gives an error when a family alignment with only one ott is being used (multiple barcodes from the same ott).
Change raxml rule from shell command line to python script that checks if a constraint tree newick file is empty first. If so run raxml without constraint tree, else with a constraint tree.

Data thinning

We don't need all sequences to end up on the tree. There are several reasons why data thinning would be a good idea:

  • unevenness in the tip density introduces spurious uncertainty when running pplacer (Lena's thesis)
  • data thinning uses less drive space
  • data thinning reduces the running time of the alignment and tree-building processes

On the other hand, the placement of a sequence and its branch length could be an indicator used in reference library
curation. A sequence with a suspiciously long terminal branch or a problematic placement (e.g. not next to another
BIN within the same species, or not in the expected place within the family) might be a misidentification or a
sequencing error. Hence, data thinning should be optional. If enabled, it should be something along the following lines:

  • one sequence >= config['minlength'] per BIN
  • probably the centroid or otherwise the most common sequence

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.