ctmrbio / stag-mwc Goto Github PK

View Code? Open in Web Editor NEW

26.0 6.0 13.0 10.86 MB

StaG Metagenomic Workflow Collaboration

Home Page: https://stag-mwc.readthedocs.org

License: MIT License

Python 99.82% Shell 0.18%

snakemake-workflows metagenomic-analysis metagenomics research microbiome

stag-mwc's Introduction

StaG Metagenomic Workflow Collaboration (mwc)

The StaG Metagenomic Workflow Collaboration (mwc) project focuses on providing a metagenomics analysis workflow suitable for microbiome research and general metagenomics analyses.

Please visit https://stag-mwc.readthedocs.io for the full documentation.

Usage

Step 0: Install conda and Snakemake

Conda and Snakemake are required to be able to use StaG-mwc. Most people would probably want to install Miniconda and install Snakemake into their base environment. When running StaG with the --use-conda or --use-singularity flags, all dependencies are managed automatically. If using conda it will automatically install the required versions of all tools required to run StaG-mwc. There is no need to combine the conda and singularity flags: the Singularity images used by the workflow already contain all required dependencies.

Step 1: Clone workflow

To use StaG-mwc, you need a local copy of the workflow repository. Start by making a clone of the repository:

git clone [email protected]:ctmrbio/stag-mwc

If you use StaG-mwc in a publication, please credit the authors by citing either the URL of this repository or the project's DOI. Also, don't forget to cite the publications of the other tools used in your workflow.

Step 2: Configure workflow

Configure the workflow according to your needs by editing the file config/config.yaml. The most common changes include setting the paths to input and output folders, and configuring what steps of the workflow should be included when running the workflow.

Step 3: Execute workflow

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores N

This will run the workflow locally using N cores. It is also possible to run it in a cluster environment by using one of the available profiles, or creating your own, e.g. to run on CTMR's Gandalf cluster:

snakemake --profile profiles/ctmr_gandalf

Make sure you specify the Slurm account and partition in profiles/ctmr_gandalf/config.yaml. Refer to the official Snakemake documentation for further details on how to run Snakemake workflows on other types of cluster resources.

Note that in all examples above, --use-conda can be replaced with --use-singularity to run in Singularity containers instead of using a locally installed conda. Read more about it under the Running section in the docs.

Testing

A very basic continuous integration test is currently in place. It merely validates the syntax by trying to let Snakemake build the dependency graph if all outputs are activated. Suggestions for how to improve the automated testing of StaG-mwc are very welcome!

Contributing

Refer to the contributing guidelines in CONTRIBUTING.md for instructions on how to contribute to StaG-mwc.

If you intend to modify or further develop this workflow, you are welcome to fork this reposity. Please consider sharing potential improvements via a pull request.

Citing

If you find StaG-mwc useful in your research, please cite the Zenodo DOI: https://zenodo.org/badge/latestdoi/125840716

Logo attribution

Animal vector created by Patrickss - Freepik.com

stag-mwc's People

Contributors

Stargazers

Watchers

Forkers

hassanfa kristapsbe aarthi31 ruijinguo aroarz vangravs twelvesummer marcelladane shigdon ruqse barry-xiao bilalshaikh42

stag-mwc's Issues

Add note about conda prefix dir to docs

In order to reduce the number of identical conda environments lying around when running StaG-mwc with the --use-conda flag out of several working directories, it could be useful to know about the existence of the --conda-prefix DIR argument to snakemake. It will then create and use workflow-related conda environments out of DIR, instead of putting them in .snakemake/conda in the current directory. This reduces unnecessary conda environment redundancy and overall space consumption. Conda hashes the environment build, so there's no risk of accidentally using the wrong environments anyway.

Force matplotlib<2.0 for biobakery conda env

The utility scripts in metaphlan2 require older versions of matplotlib, https://bitbucket.org/biobakery/metaphlan2/issues/18/axisbg-deprecated-in

Add BBCountUnique

I want to add an additional read pre-processing step that aims to assess the sequencing depth of a sample by counting the number of unique observed k-mers in the data. The BBMap suite has a tool that does this, bbcountunique.sh, and I already have a Python script that produces plots from the data that I've used in another workflow.

Add config option for groot report -c parameter

I'd like to add an option to modify the --covCutoff parameter to groot report. It should probably also be moved to a separate rule, so it can be rerun with different parameter settings without having to rerun the alignmen step.

Add taxonomy profiler: kraken2

A lot of people seem to like Kraken, so maybe we could consider adding that as an optional taxonomic profiling tool. The output from Kraken can also be used with Pavian so presentation of the results become a breeze. Krakan is available to install via bioconda, so it should be easy to add to StaG-mwc.

The minimum database size to make use of Kraken is probably their MiniKraken (4GB) database, which is a 2.9 GB download.

Double check and validate count pileup

Check for discrepancies between stag and medusa

Plot proportion human DNA

We should write a small script that parses the statsfiles from remove_human so we can plot the proportion human DNA in each sample.

remove_human logs to wrong dir

Set logdir of remove_human to LOGDIR/remove_human instead of OUTDIR/logs/remove_human

Replace BBMap to MEGARes with CARD's RGI

Change BAM-file to be temporary file

Remove BAM file when count-file is generated in mappers rule

Create Krona output for Centrifuge

We should add automatic Krona plot generation for all the currently implemented taxonomic profilers. It was my plan from the beginning, but I forgot to add it when I added the taxonomic profilers. It shouldn't be too hard to add a rule for each profiler that gathers all output files and generates a combined Krona plot HTML document.

Update to latest version of GROOT (v.0.8.4)

The v0.8.3 version of GROOT that we're currently using contains a bug concerning plots of RESFINDER gene coverages, as some gene names in RESFINDER contains forward slashes. It has been fixed in the most recent version on conda (v.0.8.4).

Add assembler: SPADes

Add a SPADes assembly step.

Stop conditionally including rules, instead conditionally include output files

This will enable users to request a single final stage as True in the config file, and Snakemake will still be able to find the path of rule dependencies to produce output files from the requested step.

The plan is simple:

Stop conditionally including rules in the main Snakefile. Instead conditionally include output files into the all_files list in the rule files of each sub step of the workflow.

Filter reads on length before running groot

Groot requires all reads to be almost equally long. We should drop all reads not conforming to the length used in database construction (user-definable parameter) before running groot, otherwise it fails.

Add possibility to run mapper several times using different databases and settings

Last night I was thinking about adding the ability to run any of the mappers (e.g. BBmap or Bowtie2) several times in a workflow, using different databases and different settings. I know from personal experience that I'd probably like to map to a database containing a specific set of genes, but also map to a larger gene catalog for overall functional content assessment. Without having to actually implement distinct mapping steps for the different databases, it would be more flexible if we could configure mapping against several databases using the config file. I was thinking about putting the mapping settings in a list of databases to map against underneath the mapper configuration, e.g. in config.yaml:

#########################
# Mappers
#########################
bbmap:
    - db_name: "db_1"
      db_path: "/path/to/db1"
      min_id: 0.76
      extra: ""
      counts_table:
          annotations: ""
      featureCounts:
          annotations: "/path/to/db2.annotation.gtf"
          feature_type: "gene_id"
          attribute_type: ""
          extra: "" 
    - db_name: "db_2"
      db_path: "/path/to/db2"
      min_id: 0.98
      extra: ""
      counts_table:
          annotations: "/path/to/db2.annotation.tab"
      featureCounts:
          annotations: ""
          feature_type: ""
          attribute_type: ""
          extra: ""

This would define two database to map reads against using BBMap. Additional databases can be mapped by adding more items to the list. Then, at least the way I thought of this, we would let the code in rules/mappers/bbmap.smk programatically generate mapping rules for each of the reference databases. For example, something like this:

for bbmap_config in config["bbmap"]:
    bbmap_output_folder = OUTDIR/"bbmap/{db_name}".format(db_name=bbmap_config["db_name"])
    bbmap_log_dir = LOGDIR/"bbmap/{db_name}".format(db_name=bbmap_config["db_name"])
    rule :
        """BBMap against {db_name}""".format(bbmap_config["db_name"])
        input:
            read1=OUTDIR/"filtered_human/{sample}_R1.filtered_human.fq.gz",
            read2=OUTDIR/"filtered_human/{sample}_R2.filtered_human.fq.gz",
        output:
            sam=bbmap_output_folder/"{sample}.sam.gz",
            covstats=bbmap_output_folder/"{sample}.covstats.txt",
            rpkm=bbmap_output_folder/"{sample}.rpkm.txt",
        log:
            stdout=str(bbmap_log_dir/"{sample}.bbmap.stdout.log"),
            stderr=str(bbmap_log_dir/"{sample}.bbmap.statsfile.txt"),
        shadow: "shallow"
        conda: "../../envs/stag-mwc.yaml"
        message: "BBMap of {{sample}} against {db_name}".format(db_name=bbmap_config["db_name"])
        threads: 8
        params:
            db_path=bbmap_config["db_path"],
            min_id=bbmap_config["min_id"],
            extra=bbmap_config["extra"],
        shell:
            """
            ...
            """

It might not be the prettiest, but it should get the job done. One potential issue, however, is that it is currently not possible to programatically define rule names in Snakemake, an issue which is sidestepped by producing anonymous rules (i.e. rules without a name), for which Snakemake will come up with its own names, which might not be entirely descriptive... But that's a minor issue, in my opinion. At least if you consider that the rule description can still contain the importat information. Also, the log message when the rule is executed could be modified to include the relevant information.

MinHash plot is wrong

https://github.com/boulund/stag-mwc/blob/master/scripts/plot_sketch_comparison_heatmap.py#L57

It should plot a clustering of the correlation between the samples, i.e. similarity_matrix.corr().

Add taxonomy profiler: centrifuge

I want to add Centrifuge as an alternative taxonomic profiling tool. The output from centrifuge can be used with Pavian, which is a nice visualization tool to start/summarize the analysis of the results.

Centrifuge is also available via bioconda, so it is easy to make a separate conda environment for centrifuge-associated rules. In addition, there is a Docker image available as well, should we want to use Docker.

Add assembler: MEGAHIT

Add an assembly step that uses MEGAHIT.

Change MetaPhlAn2 to use `-t rel_ab_w_read_stats`

It would be nice for #47 to have MetaPhlAn2 output estimated read counts, which can be accomplished using the command line argument -t rel_ab_w_read_stats. It will likely require some modifications to downstream code that merges or otherwise deals with reading the metaphlan2 tables (e.g. the alpha- and beta-diversity stuff from #47).

Implementera kvarvarande MEDUSA steg i StaG

Implementera bowtie mapping, counting, combining, annotating

Medusa scripts:

streamAligner.py
combineCounts.py
annotateCounts.py

input needed:

filtered.humanremove.reads
database index
database annotation tsv file

Plan

Starta med bowtie (Vi kanske kan implementera fler mappers senare).
- Parat eller inte parat, det är frågan...
Ge database index som input
Ett folder-system per database (Ge folder samt filer namn som sätts i config-filen)
Starta med den counting som finns i Medusa idag (lägg andra count-möjligheter som options)
Statistics (maybe we can start with what is used in MEDUSA but if you want to adjust, please adjust.)
Combine samples into one file (features som rader, prov som kolumner). Jag har skrivit hjälpskript till detta steget om ni vill kika.
Summera statistik baserat på prov.
Annotate. Collaps feature into the same annotation given in the annotation file.
- input: columns to use for annotation (eg. 1,4,5).
- Output: 1 combined countfile per annotation.

Filsystemet kommer bli (typ):

MEDUSA_databasnamn (folder)
       Counts (folder)
             counts
             counts.log
       Combine (folder)
            combine.feature
            combine.anno1
            combine.anno4
            combine.anno5
       summary.stats.all.samples

Add link to readthedocs more clearly in README

Maybe add it just below the logo in the README?

Produce count files from bowtie2 output

@lis4matilda suggested we start off with the counting that was used in MEDUSA. The count files could e.g. be put in a separate Counts folder inside the bowtie2 mapping results folder, inside a folder named after the database used to map against:

output_dir/
    bowtie2/
        <db_name>/
            Counts/
                <sample>.counts.tab

Commit bdd980d added a rule to use BBMap's pileup.sh to produce two output files from the bowtie2 mapping output bam file: <sample>.covstats.txt and <sample>.rpkm.txt. The two files contain the following information:

covstats

The covstats output from BBMap (in this case pileup.sh) contains the following columns:

ID                  Name of reference sequence
Avg_fold            Average fold coverage across reference sequence
Length              Length of reference sequence
Ref_GC              GC content of reference sequence
Covered_percent     Percent of reference sequence covered by reads
Covered_bases       Number of bases of reference sequence covered by reads
Plus_reads          Number of reads mapping in plus orientation
Minus_reads         Number of reads mapping in minus orientation
Read_GC             GC content of mapped reads
Median_fold         Median fold coverage
Std_dev             Standard deviation

rpkm

The rpkm output from BBMap (in this case pileup.sh) contains the following information on the first four lines:

The name of the input file (i.e. the bowtie2 mapping output bam file)
The total number of reads
The number of mapped reads
The number of reference sequences in the database

It then contains the following columns:

Name      Name of reference sequence
Length    Length of reference sequence
Bases     Number of bases in reference sequence
Coverage  Coverage of reference sequence
Reads     Number of reads that mapped to reference sequence
RPKM      Reads per kilobase million 
Frags     Number of fragments mapped to reference sequence
FPKM      Fragments per kilobase million

I think this information is pretty much what you want @lis4matilda, but is there anything else output by MEDUSA that this doesn't cover?

It is fairly trivial to merge reads counts per sample or whatever you want to combine into the final combined output files.

conda: Placeholder of length '80' too short in package

There is a well-known issue with conda that occurs when installing older conda packages (built with conda-build < 2). The reason is that these packages come with binaries built with an assumption of path no longer than 80 characters. Conda packages built with a more recent version of conda-build increases this limit to 255, which should fix most issues.

However, for StaG, the work around is to use --conda-prefix when running Snakemake to set an as short path as possible for the conda environments. In the worst case, maybe a symlink to a folder in in /tmp (e.g. /tmp/symlink, which points to /my/long/path/somewhere/else) might work?

Edit: a link to what seems to be the most commonly cited issue about this: conda/conda#4088

Improve automated reports from StaG-mwc

Snakemake has built-in functionality to produce automated reports. I would like to figure out if we could use this to produce something like module-level summaries containing the most important information into a nice overview report that can be used to assess the basic outcome of a StaG-mwc run.

Documentation concerning execution reports is out of date

Version 0.3.0-beta contains an issue in the docs. The execution report is actually created automatically, so the user is no longer required to run anything to create the HTML report after a successful run.

Add mapper: bowtie2

Implement a stage in the workflow to do mapping with bowtie2 against any bowtie2-indexed database. Output mapping results in a folder structure showing what database was used, similar to how it was discussed in #1.

This is now implemented in the bowtie2 branch: https://github.com/boulund/stag-mwc/tree/bowtie2, waiting to be merged.

Add taxonomy profiler: MetaPhlAn2

Another common taxonomic profiling tool, MetaPhlAn2, could be useful to add to StaG-mwc. Of course, Pavian also has support for visualizing the outputs from MetaPhlAn2. MetaPhlAn2 is available via bioconda, making it fairly straightforward to add to the workflow using a specific conda environment. I think it is also possible to run via Docker (maybe this one?), should we want to do that.

I think that MetaPhlAn2 doesn't require the download of a large reference database if installed via bioconda, so maybe we should consider using this as the default taxonomic profiling tool...

Humann2 creates huge temp files in output folder

Redirect temporary files from humann2 to tmp folder. On Uppmax the nodes tmp folder is called $SNIC_TMP.

Add second FastQC run after quality filtering trimming and human sequence removal

Add antibiotic resistance gene detection

I want to add a step specifically to detect antibiotic resistance genes. My initial plan is to just use MEGARes as it's designed for use with shotgun data, and is fairly up-to-date. It is trivial to get coverage, aligned read counts, etc., when aligning reads using BBMap. The detection of AR genes is a fairly specific task that I know that I would like to routinely perform, hopefully warranting the creation of a specific workflow step for this rather than relying on a more general BBMap mapping step in StaG-mwc.

Add printout of citations to used tools after successful run

This could be implemented by having each rule file add a citation string to a list, which is then printed in full after the workflow successfully completes. This should also be easy to include in automated reports (e.g. #41).

Metagenome assembly

We need to figure out how we want to implement metagenome assembly. The discussion needs to be divided into several parts:

What assembler(s) we want to use
How to make the Snakemake technical implementation of the assembly step so that any assembler can be used to produced output files for steps downstream of the assembly
What steps to add downstream of assembly

1. What assemblers to use

Several options are available to us. According to CAMI, MEGAHIT performed fairly well. It appears SPADes was not included in the evaluation, but it seems like a popular assembler nonetheless. I suggest we start with adding these, unless someone has other preferences.

2. How to implement it

Perhaps it is easiest if we let each assembler output into its own folder, but move the assembled contigs into a folder specifically for assembled contigs, for example (the actual outputs from the two assemblers in the example look different in real life):

output_dir/spades
├── sample1
│   ├── contigs.fa
│   ├── file1.txt
│   └── file2.fq
└── sample2
    ├── contigs.fa
    ├── file1.txt
    └── file2.fq
output_dir/megahit
├── sample1
│   ├── contigs.fa
│   ├── file1.txt
│   └── file2.fq
└── sample2
    ├── contigs.fa
    ├── file1.txt
    └── file2.fq
output_dir/assembled_contigs
├── sample1.contigs.megahit.fa
├── sample1.contigs.spades.fa
├── sample2.contigs.megahit.fa
└── sample2.contigs.spades.fa

A structure like this should make it fairly straightforward for us to design downstream steps that just process the contig files found in output_dir/assembled_contigs/, making it flexible enough so that users can pick whatever assembler they want, and also make comparisons between assemblers if that is interesting. It has the potential drawback of having to run all downstream steps that depend on assembly for the output from all assemblers used, but I guess most users will just pick one assembler and stick to that.

3. What steps to add downstream of assembly

Typical steps we could put here are binning steps, to bin the assembled contigs into "metagenome-assembled genomes" (MAGS)/"metagenomic species" (MGS), which can then be further analyzed. Two popular alternatives here could be MaxBin or CONCOCT, with a follow-on checkup of bins using CheckM. Potentially, the CheckM-related genome binner GroopM could be used if samples are taken from "at least 3 timepoints" (according to their documentation), which might make it slightly difficult to implement in StaG-mwc as a general tool, because the assumptions that GroopM makes probably fail for most of our study designs.

I know there was some talk about co-abundance clustering to produce co-abundance gene groups (CAGs), exactly how were you thinking about that @lis4matilda?

Create Rackham cluster configuration file

We need to create a cluster configuration file for use on UPPMAX's Rackham cluster.

Issues with naming of Slurm jobs and logfiles

Currently, there is an issue with the naming of Slurm jobs and logfiles. Not all steps of StaG-mwc uses the {sample} wildcard in the input declarations (typically jobs that summarize data from all samples), and thus cannot be run using the naming scheme for Slurm log files that is currently implemented in the Rackham profile: https://github.com/boulund/stag-mwc/blob/master/cluster_configs/rackham/rackham.yaml#L8-L10

I would like to put all Slurm log files in a folder in the output directory, but I haven't found a good way of doing this yet. For example, I'd like the slurm logs to go into files like OUTDIR/logs/slurm/slurm-{rule}-{wildcards.sample}-{jobid}, but that's not possible due to several reasons:

{wildcards.sample} might not exists (as I mentioned in the introduction above)
{jobid} would be the Snakemake-assigned jobid of the job (as we cannot know the jobid of the Slurm job prior to submission), but that is not accessible in the config file when setting the logfile name.
It is currently not possible to access the OUTDIR variable, nor is stuff like config[logdir] possible here.

I think we'll have to go back to the more basic default slurm logfile naming that Snakemake uses at the moment, and live with that the workdir gets flooded with slurm logs, at least until a better solution appears.

For reference, I recently created an issue about the Slurm logfile output directory problem on the official Snakemake repo on Bitbucket: https://bitbucket.org/snakemake/snakemake/issues/838/how-to-create-output-folders-for-slurm-log

Add/improve annotation after mapping from tab-delim file

Add as option for count_table under mappers rule the number of columns to produce annotation for.

Takes as input: The count per feature from the mapping and the annotation file under count_table, annotation.
annotation file is tab delimited file with all features in db in the first column.
All counts on feature with the same annotation is sum up and written to that annotation in the annotated count file
One count file is produced for each column specified. If 3 columns are specified 3 count files will be produced.
Number of rows in count file is number of unique annotations in the specific column.
Number of columns in annotated count file is the same as number of samples.

Create automatic plot of proportion human sequences

I have some code that will parse this information out of BBMap's statsfile output and create a simple plot that we can use as a starting point.

Check for length of SAMPLES

I just encountered this issue myself. If the user enters a path or filename pattern for samples that leads to no samples being found, StaG doesn't report an error. We should just add something like

if len(SAMPLES) < 1:
    raise WorkflowError

Fix SAF and GTF links in module section of documentation

There are two undefined links in the docs.

Cluster sketch heatmap

I was thinking it could be nice to produce both the standard sketch comparison heatmap along with a clustered version of the same data, to see if there are samples that display strong clustering already in the sketches.

DB download and indexing rules not working

I have accidentally messed up the helpful rules for automated database download and indexing for all tools. The check for the existence of a database in the locations specified in the config file occurs and raises a workflow error before the rule to download/index the database can be executed. I have to rethink how this is supposed to work...

Add mapper: BBMap

I want to add a step for mapping reads against a database using BBMap. BBMap is already used in the read preprocessing steps, so adding a general read mapping step similar to the bowtie2 one should be fairly straightforward.

BBMap can produce output in most formats and can readily produce coverage, rpkm, etc in a single step directly when mapping.

Add MinHash comparisons of all samples

I would love to get MinHash comparisons of all samples being analyzed. I've previously used both Mash and Sourmash for this, and I've recently started looking into the possibility of using sketch.sh/comparesketch.sh from the BBMap suite of tools. I haven't assessed the quality/usability of the output that the BBTools produce, but I'm thinking they're probably good enough. The benefit of using e.g. Sourmash is that that package already has code to produce some decent looking plots of the results, with a heatmap and dendrograms, something we would have to code ourselves if going with BBMap's tools.

Add changelog

This project is starting to accrue improvements and changes that makes sense to document in a changelog. In the link I've provided, there are some general concepts regarding changelogs that I think make a lot of sense. I will soon make a PR with a draft for the first changelog.

Consider adding Decontam

https://benjjneb.github.io/decontam/vignettes/decontam_intro.html

Add MultiQC

I suggest we add a MultiQC to produce a summary report after workflow completes. It will summarize FastQC output, and some of BBMap's output files as well.

Add -c bioconda -c conda-forge to Snakemake install instructions in docs

The current snakemake installation instructions in the documentation does not include -c bioconda -c conda-forge, and thus leads to conda installing a much too old version of snakemake. The docs need to be updated to conda install -c bioconda -c conda-forge snakemake.

Update docs to ctmrbio/stag-mwc

After moving the repo to ctmrbio/stag-mwc from boulund/stag-mwc, we need to update the docs to reflect the new repo location.

Add SUPER-FOCUS

I've started working on adding SUPER-FOCUS now that it's installable via conda. We will probably not implement automatic database downloading for SUPER-FOCUS to begin with, as I think the database download utility supplied with SUPER-FOCUS requires some more work for it to be easy to use in StaG

Compute Alpha and Beta diversity

We should consider adding a rule or two to automatically compute alpha diversity and maybe do some simple beta-diversity plots on all input samples, as that is often requested by people.

I'm thinking the best way to implement this is to use scikit-bio and write a few custom Python scripts.