naturalis / barcode-constrained-phylogeny Goto Github PK

View Code? Open in Web Editor NEW

1.0 9.0 3.0 3.48 MB

Pipeline for building topologically-constrained phylogenies from DNA barcode data

Home Page: https://naturalis.github.io/barcode-constrained-phylogeny/

License: Apache License 2.0

Python 97.47% Shell 2.53%

dna-barcoding metabarcoding phylogenetic-diversity phylogenetic-placement phylogenetics

barcode-constrained-phylogeny's People

Contributors

Stargazers

Watchers

Forkers

pecholleyn bge-bioscan deyuanyang

barcode-constrained-phylogeny's Issues

Spurious pandas dependency errors

Sometimes snakemake claims pandas is not installed. Current fix: sudo apt-get install python3 pandas (in the terminal).

Outgroup selection with blast

Plan:

progressively index unaligned.fa, only those sequences that have ott IDs with makeblastdb -in {unaligned.fa} -dbtype nucl -out {blastdb} -parse_seqids
when add outgroups, blast while ignoring the ingroup and only keep n outgroups, e.g. blastn -query {unaligned.fa} -db {blastdb} -out {outfile.txt} -outfmt 6 -max_target_seqs n -negative_seqidlist {exclude_ids.txt}
incorporate operations in snakefile
include outgroups in constraint tree

Barcode database has no indexes or primary keys

The file BOLD_COI-5P_barcodes.db has no indexes. For example, the column barcode.taxon_id should be indexed so that joins with taxon.taxon_id go MUCH faster. Also, taxon.taxon_id seems like it should be a primary key but it is not unique.

Checklist bank request

Worked last week, now got a "HTTP 405 Method Not Allowed" message.

Submit to WorkflowHub

At time of writing, the pipeline appears technically ready so submission to WFH except for the testing, which might be doable by generating tests from the results (?)

Skip `temporary_database.db` step

A rule in the Snakefile has as an output target a file (temporary_database.db) that is renamed later on. Once that renaming has happened, the output target has disappeared and so snakemake decides to rerun everything to recover that target. This needs to be fixed by bypassing this renaming stuff and going straight for the final database file name.

Reroot unrooted raxml subtree

When the outgroup of the subtree is not monophyletic with respect to the ingroup, RAxML issues a warning but then merely writes out the unrooted tree. We can still root this with fairly defensible logic but this needs to be applied such that there is a tree file for the next steps to work with.

OpenToL 'broken' taxa

When requesting a subtree by its tip ott IDs, there are cases where one or more of the IDs are for species that OpenToL views as 'broken'. For example, Alouatta seniculus, or Callicebus personatus.

This is because at the subspecies level they are entangled with other species (respectively Alouatta sara and Callicebus coimbrai). In the requested subtree, such broken tips are only identified by a label of the form mrcaott126846ott126847, which means we can't immediately figure out which species triggered this.

The solution appears to be to:

parse the label to get the IDs (126846 and 126847)
requesting the induced subtree for those IDs
fetching the tip labels

Whichever label corresponds for the first parts with a species in the input (unaligned.fa) that hasn't otherwise been seen in previous output should be the one to keep and to graft onto the tree.

Documentation README /src

Explain each rule from snakefile_phylogeny in readme

Remove dashes (-) FASTA files

HMMer can’t align sequences which have dashes in them, remove the dashes before putting them into a FASTA file.

Change raxml prefix

To better organize the raxml ouput, add a directory for each family within the raxml/ directory

Rule family_fasta

Not sure how to run this one: for me it finishes in a second without extra output. Do I have to specify which family or sth?

OPTION: Integrate perl tool db into custom db

Feedback Rutger: The database from perl-bio-phylo-forest-dbtree (rule megatree-loader) could be integrated into our own custom db (both use Opentol ID's).

Add different Raxml command for AA alignments

Raxml needs a different commandline/model to work with AA sequences.

Rewrite into snakemake pipeline

Rewrite code to be able to run it with a snakemake file. See closed issue #23 for details

Some bold entries do not have a species name (retrieve_barcody.py)

Some queried data from BOLD systems do not contain a species name (or are not on species level).

Do we want those in a later stadium? If so an alternative name needs to be made so that the FASTA header is usable.

Roadmap next release

refine input sequence curation as per BGE 10.1 criteria
make OpenTol coverage denser by ssp expansion/contraction
ensure outgroup selection avoids sequences within ingroup distances
test exemplar selection: tallest, shallowest, average
cleanup log file proliferation
run on Odonata, then Mammalia, then Lepidoptera
fix rooting/grafting
fix database import
finish workflow/documentation.md

RAxML placement and branch length optimization

It should be possible to use RAxML slightly different. Summarizing this thread, the idea would be to:

Do phylogenetic placement of any family-level sequences that are not in the opentol
Then do branch length optimization
This obviates the need for outgroup selection and any ensuing rerooting issues.

MACSE to HMM

Hmm does not cut the headers short as MACSE did. Is it safe to delete replaceme alignment id script/rule?
Also the MACSE files are as of now still used in the snakemake file so this can be changed to hmm generated files. Wait untill we have a matk/rcbl model too?

Make families.txt's for both COI-5P familys and matK_rbcL

Some families are too large right now, we can specify which families (like all mammalia families) we want to use. Save a list with family names to family_COI-5P.txt and family_matK_rbcL.txt which can be used as wildcards in snakemake.

ERROR: Cannot find tree species

When making a subtree (create_tree.py) using a newick tree from OTL this error pops up for the first ott id (ott406633). The id is present in both the newick constraint tree and the input fasta file.

Unit tests

The pipeline needs to be testable so that the CI action can run the tests that are now commented out. This can be accomplished along these lines.

raxml with parwise alignments

raxml only takes msa file inputs, how the rule raxml is right now it will not work with the new hmm alignments.
Does it work on raxml branch with hmm alignments @NI-S ?

Correspondence between constraint tree and alignment

This is not so much an issue but an elaboration that perhaps should end up in various reports and documentation files. This is about how the tips in a constraint tree must logically correspond with the sequences in an alignment. When errors occur surrounding this topic, then this might be a reference in trying to resolve the issue. Here are some possible patterns in which tree and alignment relate to each other:

There is no constraint tree at all. This is actually the most common way in which raxml is used. You could rephrase this situation as one where there is a constraint tree but it has zero tips. That may sound trivial but it's worth keeping in mind that the tool normally places sequences in a tree based only on the signal that's in the DNA so there's no need for any tips to be in a constraint.
There's a constraint tree with some taxa from the alignment. In this case the constraint tree has a subset of the tips that are in the alignment. This is fine because it just means that there is a bit more signal besides the DNA, for example to make it so that a few tips are forced to be next to each other. This is basically what we're going for.
The constraint tree has tips that are not in the alignment. This is impossible so it produces errors. The tree searcher is now being asked to place tips in a tree but it has no DNA data to do so. Tips in the constraint tree absent from the alignment must be pruned.
The constraint tree or the alignment has multiple entries with the same name. This is also impossible. The solution here is to make the constraint tree have a polytomy (e.g. ((A1,A2,A3,A4),...)) and make the alignment also have A1, A2, A3 and A4 (instead of multiple, identical As).

I think that's all the possible arrangements :)

Documentation

I put some placeholders in the top-level README.md for bits of information that should be added. It would be good if that document could be updated. The idea is that this document is the 'landing page' for the project, so (potential) users will want to know quite a bit.

Use logging, not print statements

import logging

# Create a custom logger
logger = logging.getLogger(__name__)

logger.warning('This is a warning message')
logger.error('This is an error message')

add user arguments to script retrieve_bold_data.py

Make it possible to for user to give arguments (like custom directory names) when calling script through commandline.

Incompatible dependency graphs across rules

The failures in the github CI/CD workflow indicate that different tools end up resolving into different dependency versions (e.g. in libxml2). The way to address this would be to insulate the snakemake rules from each other and have them sit in separate conda environments or docker containers.

COI-5P filtering in API request not working (retrieve_bold.py)

Other markers are kept in the records. At this point it is filtered using the pandas dataframe. Only 1536 barcodes remain (needs to be looked at if it is correct).

Implementation status beyond raxml

The current implementation runs until the raxml target. At present, all subsequent steps need to be done manually.

hmmalign parameterization

The current wrapper around hmmalign has resulted in unaligned sequences (e.g. in Naomi's report). It's possible that this is because the produced output, in Stockholm format, is not parsed and interpreted correctly. What looks less error prone is to use a different output format. For example, with hmmalign --outformat afa [options] [hmm] [query] the result is in aligned FASTA. A quick check with some variable sequences (quite distantly related primates, with COI sequences of different lengths) resulted in pretty good alignments nonetheless:

The backbone logic

The following consideration factor into the inference of the backbone:

For each family F, the two tips F1, F2 that are most distant must be the exemplars for that family. This ensures that they will be on opposite sides of the root so that they form a fork Y in the backbone that can be replaced by the subtree for that family. The approach using the distance matrix is good for this. Within the constraint tree for the backbone, the two exemplars will be defined as a monophyletic group in the newick syntax, i.e. ...(F1,F2),....
For the relationships between the families (F,G,H,I), we want constraints from the OpenTree backbone. The challenge here is that F1 and F2 might not be in the OpenTree backbone. Instead, any single one of the tips from the families that did occur in there (e.g. any of the ones that went into the family-level constraint) will do. This might give a constraint like ((F,G),(H,I));. That constraint then needs to be updated so that for each family, the exemplars are grafted in, e.g. (((F1,F2),(G1,G2)),((H1,H2),(I1,I2)));.

Tip1: you can copy the statements that end with ;, such as (((F1,F2),(G1,G2)),((H1,H2),(I1,I2))); and paste them directly into figtree to see the tree. Do this to check the validity of the trees. They can't just be single Y splits.

Tip2: you can check the alignments in mesquite so that you see the amino acids, not the nucleotides, like so: https://www.youtube.com/watch?v=YsotPrE0KwE - do this to make sure the alignments are still OK.

Change alignment method

Use HMMer to align sequences to a COI-5P model with pairwise alignment for scalability. With the output of HMMer the sequences can be checked if its forward or reverse.

Get refs.hmm from zip file https://github.com/psomervuo/FinPROTAX and save it in repo
HMM align using profile and fasta file
Use stockholm file to check which one of the is the forward sequence (most * in alignment)

Rough to do list/workflow

Make rough ERD
BOLD data dump -> SQLite (with marker code/kingdom filter as commandline input), do not index yet
Filter and rearrange data into ERD tables (barcode records -> atleast family taxonomy available, …)
Index data (unique ID for every barcode and taxon)
Map opentol ID to taxonomy table (with family name as constraint!)
Get barcode data into chunks (group by family name) and make per family a fasta file with barcodes (header with own DB id from barcode table) | OUTPUT FASTA?
Make alignment per family (MUSCLE? HMER? For HMER nt -> aa) | OUTPUT PYLIP?
Make tree per family/alignment, use opentol tree as constraint (get tree with opentol id) | INPUT PHYLIP & OUTPUT NEWICK?

Handling empty FASTA files

if file is empty in the results/fasta/family/ directory it will still be used and iterated over in the family fasta steps. Current fix: delete the file.

Create v1.1.0

Open Tree of Life - extracting subtrees

Here is a SQLite representation of the opentree 13.4 supertree labelled with ott IDs. From this database, the DBTree toolkit can be used to extract subtrees with the megatree-pruner command. This will produce a newick tree. This tree still needs to be expanded so that tip labels that occur multiple times in the alignment are expanded into polytomies to disambiguate for raxml (see #27)

Add date/time stamp to log messages

import logging

# Create a logger
logger = logging.getLogger(__name__)

# Set the log level
logger.setLevel(logging.DEBUG)

# Create a console handler
handler = logging.StreamHandler()

# Define the format for the log messages,
# including the date and time stamp
log_format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'

# Set the date format
date_format = '%Y-%m-%d %H:%M:%S'

# Create a formatter using the specified format and date format
formatter = logging.Formatter(log_format, datefmt=date_format)

# Set the formatter for the handler
handler.setFormatter(formatter)

# Add the handler to the logger
logger.addHandler(handler)

# Example log message
logger.debug('This is a debug message.')

macse in bin/

as per Noble, 2009

ERROR: Github when running a 'build-linux' job after pushing branch

Error alignment_output = snakemake.output[0]
^
4 F821 undefined name 'snakemake'

Does not recognize snakemake in the python files? Has to do with conda env
https://github.com/naturalis/barcode-constrained-phylogeny/actions/runs/4989555728/jobs/8934184582

create_database produces inconsistencies

taxa and barcodes need to be loaded differently, review

Picking exemplars

When picking the shallowest exemplars for the backbone, exemplar clades where shallow tips have near zero distance to their subtree root result in a backbone where nearly all change is allocated to the root branch of the of the exemplar clade. When grafting the subtrees onto the basal split, the subtrees become "too tall". Here's the grafted example:

Here's the same topology but with the branch lengths re-estimated by raxml:

A follow up experiment should be to pick the tallest exemplars instead. It's possible that this will make them "too shallow" instead, especially when there is a lot of homoplasy within the clade.

Additional indexes needed

To be able to generate area phylogenies, such as the tree for the Netherlands, the country field in the barcode should be indexed.

Put ott_id in fasta header in retrieve_bold.py

To account for the name clipping of the header during in MUCLE alignment output, change the FASTA header to ott_id (OpenTol taxnomic id) while matching with OpenTol.

family_fasta.py feature needed after bugfix map_opentol.py

Need to select only barcodes with an opentol ID mapping. Right now this is not the case because it only has a very small percentage of mappings because of a request error in map_opentol.py

Release generated files using Zenodo

Check barcode length

Do a check for barcode length when making FASTA files (minimum 600 for COI-5P), make it adjustable in config.yaml

tree inference & data management

It is probably a good idea to run the steps incrementally, by which I mean: writing the alignments is a task that you can kill and then it resumes where it left of, until all families have an alignment. Same for the constraint trees: you probably want to have a tree for every alignment and store this in a folder structure. What you want to avoid is having to run everything in one giant loop that can never fail.

With that in mind, it's probably better to not have a temporary tree that has a throwaway name here, but rather have a system where the raxml run assumes that there is a constraint tree for every alignment (and they probably have the same name but with different extensions).

FIX: No constraint tree needed for families with one ott

Raxml gives an error when a family alignment with only one ott is being used (multiple barcodes from the same ott).
Change raxml rule from shell command line to python script that checks if a constraint tree newick file is empty first. If so run raxml without constraint tree, else with a constraint tree.

Data thinning

We don't need all sequences to end up on the tree. There are several reasons why data thinning would be a good idea:

unevenness in the tip density introduces spurious uncertainty when running pplacer (Lena's thesis)
data thinning uses less drive space
data thinning reduces the running time of the alignment and tree-building processes

On the other hand, the placement of a sequence and its branch length could be an indicator used in reference library
curation. A sequence with a suspiciously long terminal branch or a problematic placement (e.g. not next to another
BIN within the same species, or not in the expected place within the family) might be a misidentification or a
sequencing error. Hence, data thinning should be optional. If enabled, it should be something along the following lines:

one sequence >= config['minlength'] per BIN
probably the centroid or otherwise the most common sequence