This is the phylogeny pipeline, which builds single gene pyhlogenies for given bait sequences. With the bait sequences, the pipeline extracts matching sequences with BLAST from the protein sequence databases of NCBI and Uniprot. It extracts the sequences from NCBI remotely and from the Uniprot databases from local copies.
The pipeline requires to run:
- Standard Linux tools such as Bash
- SeqKit (user path)
- IQ-Tree2 (module load)
- Custom ete3 (installed locally from git clone) with Python 3
(ete3 requires QT5 with a running x-server, and thus was executed on a laptop)
- raxml-ng (user path)
- cd-hit (user path)
- efetch (user path)
- blastp (module load)
- TrimAl (base folder)
- RogueNaRok-parallel (base folder)
- TreeShrink (user path) with Python 2.7 (module load)
Supported aligners are:
- T-Coffee (module load)
- PASTA (user path)
- MUSCLE (module load)
- MAFFT (module load)
- FAMSA (user path)
- Clustal-Omega (module load)
Optional, however if not installed will generate an error:
- pdf2png (user path installed on a local laptop with apt)
The needed programms are installed in mainly three different ways, depending on the original needs.
- base folder is the folder where you clone this repository into.
- user path is where bash checks for programs installed,
this could be a system folder, but also a folder in your
home directory that was added to your search path.
- module load is the command used to load modules on a
cluster node into the environment. These are specified
in the .Scheduler/XX_* files. And need to be modified
for the cluster they are supposed to run on.
The pipeline is designed to run on a cluster computer, either with PBS Pro or Slurm as scheduler. In principle, more schedulers could be added. Another possibility would to install a scheduler such as PBS-Pro or Slurm on a local computer.
The phylogeny pipeline downloads the uniprot protein databases sprot and trembl in fasta format and builds from them blast databases, which requires about 210 GB as of August 2021.
If you want to use newer versions you have to delete them. The databases are in ./PhylogenyPipeline/ProteinDatabase/. There, just delete the folders uniprot_trembl and uniprot_sprot.
The phylogeny pipeline also downloads the taxon database from NCBI. If you want to use a newer version just delete it. The files are in ./PhylogenyPipeline/SpeciesDatabase/.
If you need to supply account information, when you start a job then go to file ./Scheduler/Account.sh and follow the comments given there. Note this is only implemented for Slurm, since no PBS-Pro is available for testing.