AnnotateSnakeMake is a Snakemake workflow routinely used in the Marlétaz lab to annotate genomes of echinoderm species. It takes as input a genome sequence (fasta) and RNA-seq datasets (fastq or fastq.gz, paired-end) and outputs the set of annotated protein-coding genes (bed, gtf and fasta files).
The pipeline uses Conda to deploy all of its dependencies automatically. The recommended way to install Snakemake is thus to follow the Conda/Mamba installation guidelines detailed in the Snakemake documentaion.
We recommend creating a conda environment with snakemake 7.12 and tabulate 0.8:
mamba create -n snakemake -c bioconda -c conda-forge snakemake=7.12 tabulate=0.8
Clone the repository:
git clone https://github.com/eparey/AnnotateSnakeMake.git
The following steps are ~ sequentially ran (see workflow image in images/workflow.pdf
):
- repeat annotation and masking (RepeatModeler, RepeatMasker):
workflow/rules/repeats.smk
- transcriptome assembly, de novo (Trinity, gmap):
workflow/rules/transcriptome_de_novo.smk
- transcriptome assembly, genome-guided (STAR, STRINGTIE, TACO):
workflow/rules/transcriptome_using_genome.smk
- consensus transcriptome (Mikado):
workflow/rules/consensus_transcriptome.smk
- model training and gene prediction with AUGUSTUS (hints from Metaeuk, Mikado and Portcullis):
workflow/rules/gene_pred.smk
- repeat filtering from gene models:
workflow/rules/filter_repeats.smk
- gene models refinement with PASA:
workflow/rules/gene_annot.smk
- evaluation of the annotation (BUSCO, PFAM), gene naming (DIAMOND, SWISSPROT) and file formatting:
workflow/rules/final_models.smk
The test sets serves as an example for input data specification and formatting (see config/config.yaml
and resources/
).
Before running, users need to download and format the pfam database, and specify its location in config/config.yaml
(please see download instructions at https://github.com/aziele/pfam_scan, 'Prepare library of Pfam HMMs').
-
The pipeline has to be run from the
AnnotateSnakeMake
folder with the snakemake environment activated (see above to create the 'snakemake' conda env):cd AnnotateSnakeMake conda activate snakemake
-
To print a dry-run:
snakemake --configfile=config/config.yaml --use-conda --cores 48 -n
-
To run on the test set:
snakemake --configfile=config/config.yaml --use-conda --cores 48
-
Create a specific directory for the run:
mkdir ../Afil_annotation
-
Create a configuration file. We recommend copying and editing the example config (
cp config/config.yaml ../Afil_annotation/config_afil.yaml
). Note that relative paths should be replaced by absolute paths and paths to the user-provided data. -
Run the pipeline, still from within the
AnnotateSnakeMake/
folder but providing the run directory with--directory
:conda activate snakemake snakemake --configfile ../Afil_annotation/config_afil.yaml --directory ../Afil_annotation/ --use-conda --cores 48
The workflow was primarily deposited for reproducibility purposes, but if you are interested in running it but are encountering issues, please do not hesitate to contact us, we are happy to help:
We also thank Chema Martin-Duran for contributing to the development of the pipeline.
The annotation workflow is described in the methods section of the brittle star Amphiura filiformis genome paper:
Parey, E. et al. The brittle star genome illuminates the genetic basis of animal appendage regeneration. Nat Ecol Evol (2024). https://doi.org/10.1038/s41559-024-02456-y