snakemake-workflows / docs Goto Github PK

View Code? Open in Web Editor NEW

151.0 151.0 24.0 17 KB

Documentation of the Snakemake-Workflows project

License: MIT License

snakemake snakemake-workflows

docs's People

Contributors

Stargazers

Watchers

docs's Issues

Additional python module requirements

hey @johanneskoester another quick question! In the case of a local run (without Singularity or Docker) is the user in charge of ensuring all dependencies (snakemake, pandas if used, and others) are installed? if so, I'll add a requirements.txt with instructions to use to the workflow, unless there is a specific place that we should add this to ensure the install?

How to import common scripts in scripts

I was wondering If I want to use the script directive and in my (python) script I want to access other scripts in the directory scripts/common.

How can I make the import?

Tutorial does not create bam indices

I deleted the two results folders and re-ran the full script but the bam's do not get indexed
What I am missing?
Thanks

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{sample} "
        "-O bam {input} > {output}"

rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"

> snakemake sorted_reads/{A,B}.bam

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	2	samtools_sort
	2

[Mon Oct 22 11:40:17 2018]
rule samtools_sort:
    input: mapped_reads/B.bam
    output: sorted_reads/B.bam
    jobid: 0
    wildcards: sample=B

[Mon Oct 22 11:40:17 2018]
Finished job 0.
1 of 2 steps (50%) done

[Mon Oct 22 11:40:17 2018]
rule samtools_sort:
    input: mapped_reads/A.bam
    output: sorted_reads/A.bam
    jobid: 1
    wildcards: sample=A

[Mon Oct 22 11:40:17 2018]
Finished job 1.
2 of 2 steps (100%) done
Complete log: /data/NC_projects/snakemake-tutorial/.snakemake/log/2018-10-22T114017.262317.snakemake.log

# is leading to 
> ls -lah sorted_reads/
total 4.4M
drwxr-xr-x 2 u0002316 domain users 4.0K Oct 22 11:30 .
drwxr-xr-x 6 u0002316 domain users 4.0K Oct 22 11:39 ..
-rw-r--r-- 1 u0002316 domain users 2.2M Oct 22 11:30 A.bam
-rw-r--r-- 1 u0002316 domain users 2.2M Oct 22 11:30 B.bam

source of test data

Hi,

I am wondering where to find the raw datasets from where the test samples are based on ?
Is the information (metadata) is available somewhere ?
It is a general question, not specific to any snakemake workflow.

Thanks !

single-end, paired-end, singletons and merged

Hey, I'm currently developing the metagenomic pipeline ATLAS. I'm searching to be more compliant with your guidelines, and maybe pack some parts in wrappers. One problem that I'm facing is that you don't know if the user is using single-end or paired-end reads.
In addition, trough quality filtering you might end with reads that lost their mate (singletons). If you don't want to lose them you will have three files from the initial two files. If you merge paired-end reads you also end up with additional files with reads, which don't have the same length distribution.
If you want to keep them separate you might end up with 4 files for the same reads.

It seems that most wrappers are not made to handle this different number of reads nor to distinguish between them. Any Idea on how to solve this issue?

In the Atlas pipeline, we solved the issue by checking at the beginning if the sample is single-end or paired-end and then input functions.

Question: best practice for providing data in the repository?

I'm putting together my first dummy pipeline, and I have toy data that is small enough to store alongside the workflow. Where is the "best practices" spot to put it? I've been looking around the examples but it seems like most don't provide data, or download from a remote. Thank you!

pypi upload of workflows

Hello fellow snakemakers!

I was thinking about how I could get the installation and use of our workflow to be even easier and thought of using pypi as another repo. This would mean that the workflow would now also be a python package.

It comes with a few neat things, one of them is that you can run a workflow from anywhere as an executable passing all arguments to snakemake as usual.

Instead of having to be in the workflow folder and run snakemake --directory WORKINGDIR you can run workflow_name --directory WORKINGDIR from anywhere and it will run the same thing.

Here is the code of the main.py file

import sys
import subprocess
import os

def main():
	arguments = ['snakemake']
	arguments.extend(sys.argv[1:])
	subprocess.run(args=arguments, 	cwd=os.path.join(os.path.dirname(__file__),'..'))


if __name__=='__main__':
	main()

Another thing that I wanted to implement is a "prepare" functionality which would be an interactive question/answer in the prompt to generate samples.csv and configuration files. In our workflow, I would, for example, ask "which chemistry have you used for your experiment" giving a list of available choices directly in the prompt. This would then write the correct chemistry directly into the config.yaml instead of relying on people to write it themselves. This would ensure that there are no spelling mistakes and check for int types for example.

This preparing was always "odd" in a workflow because normally you rely on having everything setup before running it. Although the new checkpoints might fit the bill?

One potential issue with the executable approach is that it might become confusing to the users if you have both the executable and the classic workflow available and the fact that the workflow would be in the lib dir of python and not cloned somewhere.

Since this is a big/strange shift I wanted to have your opinion before moving forward with this experiment. Let me know if I'm missing any crucial problems.

Best wishes

How best to modularize snakemake workflows

I'm developing metagenome atlas a snakemake pipeline which for metagenomics which get s you all the steps from QC, assembly, binning, genome prediction, and annotation.
I make a click wrapper to get users started in three commands.

Now I'm planning how to maintain this pipeline and thinking I should modularize it to be able to update and test parts

I read the snakemake docs about modularisation.

The first step Is definitively to make wrappers.
I thought, I should also create a sub workflows (e.g. QC/ assembly) and contribute them to snakemake-workflows project.

Now my question is how I do it the best way so that the sub-workflows can work as stand-alone workflows but also as part of metagnome atlas?

indexing is not occurring in tutorial

I deleted the two results folders and re-ran the full script but the bam's do not get indexed
What I am missing?
Thanks

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"

rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"

> snakemake sorted_reads/{A,B}.bam

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	2	samtools_sort
	2

[Mon Oct 22 11:40:17 2018]
rule samtools_sort:
    input: mapped_reads/B.bam
    output: sorted_reads/B.bam
    jobid: 0
    wildcards: sample=B

[Mon Oct 22 11:40:17 2018]
Finished job 0.
1 of 2 steps (50%) done

[Mon Oct 22 11:40:17 2018]
rule samtools_sort:
    input: mapped_reads/A.bam
    output: sorted_reads/A.bam
    jobid: 1
    wildcards: sample=A

[Mon Oct 22 11:40:17 2018]
Finished job 1.
2 of 2 steps (100%) done
Complete log: /data/NC_projects/snakemake-tutorial/.snakemake/log/2018-10-22T114017.262317.snakemake.log

# is leading to 
> ls -lah sorted_reads/
total 4.4M
drwxr-xr-x 2 u0002316 domain users 4.0K Oct 22 11:30 .
drwxr-xr-x 6 u0002316 domain users 4.0K Oct 22 11:39 ..
-rw-r--r-- 1 u0002316 domain users 2.2M Oct 22 11:30 A.bam
-rw-r--r-- 1 u0002316 domain users 2.2M Oct 22 11:30 B.bam

Accel Amplicon trimming workflow

I have created a repository, with a Snakemake workflow, used to trim data generated using Accel Amplicon Panel data using the recommended guidelines provided by Swift Biosciences

https://github.com/clinical-genomics-uppsala/accel_amplicon_trimming

Best-paractice of cross-workflow specification of files

I would like to discuss what is the best way to specify files in a way that they can be used across workflows.

Take the example of two workflows e.g

Workflow 1: reads --> assembly

Workflow 2: assembly + reads --> assembly statistics ...

What is the best way to specify the reads and assembly so that they can be used by different workflows?
Take into account that
Requirement A: The reads might be used at multiple places in Workflow 2.
Requirement B : The reads are probably to be used to infer the total number of samples in the target rule.

With sub-workflows, it would be possible to define otherworkflow(file)

But I think the recommended way now is to use modules and to import the rules Workflow 1 and 2 in a new workflow.
But then I should know which rules I need to modify to adapt the file specification. This should be necessarily defined in the Readme of a workflow.

I don't see how this can be done without massive modifying many rules of an imported workflow.

Any thoughts?

dropSeqPipe - Single cell data preprocessing snakemake workflow

Hello,

the latest version of my pipeline is trying to make it as a snakemake workflow.
I'm kindly asking for a review.

I have not yet worked on specific envs for each rule but this can be done in the future without too much effort.

Please tell me if there is anything else that I need to implement to pass the review.

Best wishes

I am working on a ChIP-Seq pipeline in Snakemake

We are hoping to publish it as an applications note. Work in progress here:

https://github.com/biocore-ntnu/chip_seq_pipeline

Just a heads-up.

It is massive so reviewing it will probably be hard :/ Any feedback appreciated though :) Going to focus hard on finishing the docs and getting a beta out.

Edit: I see that I do not have an integration test, just plenty of dryrun tests testing the DAG logic. I started this before learning about this repo, so I might not be following best practices :/

Question: Don't delete temporary wrapper scripts

I'm trying to get Snakemake working with Singularity, and I need to debug the Singularity command, but the script wrapper that does the execution doesn't exist after the failure, e.g., I need to test:

 singularity exec --home /home/vanessa/Documents/Dropbox/Code/labs/cherry/snakemake/encode-demo-workflow  --bind /home/vanessa/anaconda3/lib/python3.7/site-packages:/mnt/snakemake /home/vanessa/Documents/Dropbox/Code/labs/cherry/snakemake/encode-demo-workflow/.snakemake/singularity/328e6123b3d8f239ce917fa97ccbbd80.simg bash -c 'set -euo pipefail;  python /home/vanessa/Documents/Dropbox/Code/labs/cherry/snakemake/encode-demo-workflow/.snakemake/scripts/tmp2navashs.wrapper.py'

And why is home being force bound to be the present working directory? That seems strange for Singularity. Should it be --pwd?

installation of snakemake tutorial

Hi,

After running the command
conda env create --name snakemake-tutorial --file environment.yaml

I end up with the error:

SpecNotFound: Invalid name, try the format: user/package

Do you have any suggestions?
Thank you!

General

This is a place for general, project-related discussions.

Unit patterns file and what if you don’t have units

Couple questions, could you explain the first column of the unit patterns tsv file? The doc example looks sort of like a regex but the <nr> part is not.

Also, what if you don’t have unit components in your FASTQ file names? It’s quite common to only have one-to-one map between FASTQ and sample without any lane info. Is the workflow structure designed to also work with only sample names and without resorting to inventing units/lanes in the unit patterns file and manually changing FASTQ filenames to include these IDs?

Workflow auto-download (specified in config.yaml)

snakemake workflows is a great idea! Thanks for putting it together.

API Suggestion

What do you think about having a command line tool for running any snakemake workflow with the following user API

sworkflow config.yaml [other snakemake flags]

config.yaml would be the standard config.yaml with an additional workflow entry specifying which workflow to use (as a string):

workflow: https://github.com/snakemake-workflows/single-cell-rna-seq/tree/a1be3b6b389b009d91bb1d7f75abc1b5a23cd19d

# The usual config.yaml ----------------------------------
# path to sheet describing each cell.
cells: cells.tsv

# specify count table (rows: genes/transcripts/spikes, cols: cells)
counts: counts.tsv
...

This functionality is conceptually similar to snakemake rule wrappers, where you refer to a command with a single string.

Implementation

sworkflow command would do the following:

Git checkout the workflow source code to a common location (~/.snakemake/workflows/<myworkflow>?)
Run the snakemake command: snakemake --snakefile ~/.snakemake/workflows/<myworkflow>/Snakefile [other snakemake flags]

Motivation

This command would come handy in case you want to apply a single workflow multiple times (say you are analyzing different but related datasets). In the current case, you'd need to checkout the source-code to each directory.

Reviews

Please post below to request a review.

TODO

some example workflow

Snakemake Missing Input Files

Hi There,

I've followed all of the install instructions, and now I seem to be running into permission issues. My yaml files all seem to be working correctly now, and it's successfully create a pool.

But some of the rules at the beginning of the Snakemake file seem to be causing a permission issues when creating directories.

Provided is the error I'm getting:

I've attached my snakefile and yaml files for the batch-shipyard configuration. Some help would be really appreciated on this! I've followed the installation guide very carefully so I think it's something specific with Snakemake. What's odd is that it works no problem when installing snakemake and the dependencies in Conda.

Snakefile for the RNA-Seq analysis pipeline using test data from zebrafish
You should not need to edit this file unless you are changing the programs in the pipeline
configfile: "config_zebrafish.yaml"
SAMPLES = config['samples']

R1_suffix=config['input_file_R1_suffix']
R2_suffix=config['input_file_R2_suffix']
genome_fasta_file = config['genome_fasta_file']
genome_index_base = config['genome_index_base']
merged_transcripts_file=config['merged_transcripts_file']

rule trim_and_qc_all:
input:
html=expand("{sample}_R1.trimmed_paired_fastqc.html", sample=SAMPLES)

rule trim_reads:
input:
R1_reads="data/{sample}" + R1_suffix,
R2_reads="data/{sample}" + R2_suffix
output:
"1_trimmed_reads/{sample}_R1.trimmed_paired.fastq",
"1_trimmed_reads/{sample}_R1.trimmed_unpaired.fastq",
"1_trimmed_reads/{sample}_R2.trimmed_paired.fastq",
"1_trimmed_reads/{sample}_R2.trimmed_unpaired.fastq"
threads: config['threads']
params:
run_params=config['trimmomatic_params']
shell:
"echo -e "#!/usr/bin/env bash\ncd $FILESHARE;\n trimmomatic PE -threads {threads} {input.R1_reads} {input.R2_reads} {output}" > $FILESHARE/jobrun.sh ;\n $SHIPYARD/shipyard jobs add --configdir $FILESHARE/azurebatch --tail stderr.txt\n"