The Phylogeny Pipeline

This is the phylogeny pipeline, which builds single gene pyhlogenies for given bait sequences. With the bait sequences, the pipeline extracts matching sequences with BLAST from the protein sequence databases of NCBI and Uniprot. It extracts the sequences from NCBI remotely and from the Uniprot databases from local copies.

Prerequisites

The pipeline requires to run:

- Standard Linux tools such as Bash
- SeqKit (user path)
- IQ-Tree2 (module load)
- Custom ete3 (installed locally from git clone) with Python 3  
  (ete3 requires QT5 with a running x-server, and thus was executed on a laptop)
- raxml-ng (user path)
- cd-hit (user path)
- efetch (user path)
- blastp (module load)
- TrimAl (base folder)
- RogueNaRok-parallel (base folder)
- TreeShrink (user path) with Python 2.7 (module load)

Supported aligners are:

- T-Coffee (module load)
- PASTA (user path)
- MUSCLE (module load)
- MAFFT (module load)
- FAMSA (user path)
- Clustal-Omega (module load)

Optional, however if not installed will generate an error:

- pdf2png (user path installed on a local laptop with apt)

The needed programms are installed in mainly three different ways, depending on the original needs.

- base folder is the folder where you clone this repository into.
- user path is where bash checks for programs installed,  
  this could be a system folder, but also a folder in your  
  home directory that was added to your search path.
- module load is the command used to load modules on a  
  cluster node into the environment. These are specified  
  in the .Scheduler/XX_* files. And need to be modified  
  for the cluster they are supposed to run on.

The pipeline is designed to run on a cluster computer, either with PBS Pro or Slurm as scheduler. In principle, more schedulers could be added. Another possibility would to install a scheduler such as PBS-Pro or Slurm on a local computer.

Databases

The phylogeny pipeline downloads the uniprot protein databases sprot and trembl in fasta format and builds from them blast databases, which requires about 210 GB as of August 2021.

If you want to use newer versions you have to delete them. The databases are in ./PhylogenyPipeline/ProteinDatabase/. There, just delete the folders uniprot_trembl and uniprot_sprot.

The phylogeny pipeline also downloads the taxon database from NCBI. If you want to use a newer version just delete it. The files are in ./PhylogenyPipeline/SpeciesDatabase/.

User Account Information

If you need to supply account information, when you start a job then go to file ./Scheduler/Account.sh and follow the comments given there. Note this is only implemented for Slurm, since no PBS-Pro is available for testing.

martinguehmann / phylogenypipeline Goto Github PK

phylogenypipeline's Introduction

The Phylogeny Pipeline

Prerequisites

Databases

User Account Information

phylogenypipeline's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent