gjospin / phylosift Goto Github PK

View Code? Open in Web Editor NEW

82.0 82.0 18.0 168.4 MB

Phylogenetic and taxonomic analysis for genomes and metagenomes

Perl 80.20% Shell 0.01% CSS 0.07% Python 0.14% R 1.04% TeX 18.54%

phylosift's People

Contributors

Stargazers

Watchers

Forkers

ryneches wwood hyphaltip mlangill matsen luhugerth fw1121 scwatts ofanoyi nompumelelo7 hrk2109 dry-lab gsc0107 lrf2019 khemlalnirmalkar wook2014

phylosift's Issues

Download Tree - 18S euks

Get representative sequences - 18S euk

Easy way to calculate KR distances among samples with guppy

guppy can calculate the Kantorovich Rubenstein distance among a pair of pplacer placement files. These distances are useful for assessing how similar two microbial communities are to each other. amphora2 should have a way to calculate these distances among a group of samples using the protein marker families.

Retrieve model data from SSU align - Archaeal 16S

Include all protein families as another amphora2 database

The set of 350k protein families needs FastTrees with logs and pplacer reference packages created for them. These should go into a separate marker database since it's likely to be huge and not everybody will want it.

SOP and Tutorials for using PhyloSift

Calculate edge PCA in amphora2

Based on KR distances among samples, one can calculate an edge principle components analysis which, among other things, can highlight which groups of organisms contribute the greatest difference between microbial communities. amphora2 should have an easy way to invoke guppy to do this for a group of samples.

Implement pplacer to return a jplace file from aligned environmental sequences

Build de novo phylogeny in FastTree - 16S bac

FastTree -nt -gtr -log logname

Cleanup garbage outputs (set debug false)

Tree reconciliation between marker genes and rRNA data (find tool)

Take two jplace files and merge them into one?
Build a protein tree that contains everything in the 16S database but is constrained to have the relationships in the 16S database?

Investigate mitochondrial markers?

Creating automated test cases for stable release versions

Download unaligned sequences - 16S arch/bac

Download unaligned sequences - 18S euks

unit tests for perl functions > 1 month old

all non-trivial, stable functions should have unit tests associated with them.

Prune Reference Tree - 16S arch/bac

Prune Reference Tree - 18S euks

Align representative sequences - 16S arch/bac

Write perldoc for anything >1 month old

Build de novo phylogeny in FastTree - 18S euks

FastTree -nt -gtr -log logname

join paired reads in concatenate output

when a single long read or two reads in a paired-end read hit multiple markers, they are not treated as a single sequence in the concatenate alignment produced. This needlessly discards valuable linkage information.

BlastX-oof is too noisy (redirect to dev/null)

Identify eukaryotic genome data for building reference collection

Identify and describe file formats for compatibility with QIIME

OTU tables
FASTQ format for OTUs (labelled headers)
Tree for unifrac distances

Build database of 16S sequences

Pull down data from Greengenes database

Separate out Bacteria and Archaea into separate databases

Filter and process

Develop an auto update method for rRNA reference packages

E.g. don't add anything new unless its more than 1% different than anything else in the tree

"recursive" processing of reads in well-sampled parts of the tree

Phylogenetic placement of reads that are very close to a reference sequence would be more accurately placed using their DNA sequence. We should identify these reads/sequences on the basis of rapsearch/blast output and flag them for DNA analysis instead of protein analysis.

Related to this, we will need the updating script to divide up the database into subsets of taxa that are similar enough for nucleotide analysis. Leaving them lumped together has been tried and results in poor quality inference -- the phylogenetic models get confused by the extensive diversity at any particular site. An alternative would be codon analysis, but no known read placement tool supports this.

Output summary on Greengenes taxonomy - text

Implement a buildbot with tests for PhyloSift

buildbot is awesome:

http://buildbot.net/buildbot/docs/current/tutorial/

Filling out reference marker gene sets directly from metagenome data

Identify where we can get metagenomes from public databases
Script that downloads these files
Feet to metamos or other metagenome assembler
Take all the marker gene hits and put these into marker gene database

Build de novo phylogeny in FastTree - 16S arch

FastTree -nt -gtr -log logname

Create reference rRNA marker package with Taxit - 16S bac

Align representative sequences - 18S euks

draft a manuscript describing the method

Phylogenetic analysis of genomes and metagenomes for the 1%.

Output summary on Greengenes taxonomy - visual (Krona)

PD pruning during marker database update

Currently the marker databases include sequences from all genomes. Many of these are 100% identical to each other and don't offer any additional phylogenetic resolution. Inclusion of these sequences makes the marker gene database large (currently > 1GB). PD pruning could be used to include a subset of sequences that are most informative.

Compare Dongying's euk marker genes with Parfrey phylogeny

design and construct test datasets

ideally there would be a script to generate test datasets from genome sequence data.
classic approach is to take reads from isolate genomes and mix them in known abundances.
need to get isolate genome reads.
design microbial communities using them.
aaron has some of these scripts already.

also would be good to construct in-vitro simulations.

gjospin / phylosift Goto Github PK

phylosift's People

Contributors

Stargazers

Watchers

Forkers

phylosift's Issues

Recommend Projects

Recommend Topics

Recommend Org