snakemake-workflows / docs Goto Github PK
View Code? Open in Web Editor NEWDocumentation of the Snakemake-Workflows project
License: MIT License
Documentation of the Snakemake-Workflows project
License: MIT License
hey @johanneskoester another quick question! In the case of a local run (without Singularity or Docker) is the user in charge of ensuring all dependencies (snakemake, pandas if used, and others) are installed? if so, I'll add a requirements.txt with instructions to use to the workflow, unless there is a specific place that we should add this to ensure the install?
I was wondering If I want to use the script directive and in my (python) script I want to access other scripts in the directory scripts/common.
How can I make the import?
I deleted the two results folders and re-ran the full script but the bam's do not get indexed
What I am missing?
Thanks
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
> snakemake sorted_reads/{A,B}.bam
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
2 samtools_sort
2
[Mon Oct 22 11:40:17 2018]
rule samtools_sort:
input: mapped_reads/B.bam
output: sorted_reads/B.bam
jobid: 0
wildcards: sample=B
[Mon Oct 22 11:40:17 2018]
Finished job 0.
1 of 2 steps (50%) done
[Mon Oct 22 11:40:17 2018]
rule samtools_sort:
input: mapped_reads/A.bam
output: sorted_reads/A.bam
jobid: 1
wildcards: sample=A
[Mon Oct 22 11:40:17 2018]
Finished job 1.
2 of 2 steps (100%) done
Complete log: /data/NC_projects/snakemake-tutorial/.snakemake/log/2018-10-22T114017.262317.snakemake.log
# is leading to
> ls -lah sorted_reads/
total 4.4M
drwxr-xr-x 2 u0002316 domain users 4.0K Oct 22 11:30 .
drwxr-xr-x 6 u0002316 domain users 4.0K Oct 22 11:39 ..
-rw-r--r-- 1 u0002316 domain users 2.2M Oct 22 11:30 A.bam
-rw-r--r-- 1 u0002316 domain users 2.2M Oct 22 11:30 B.bam
Hi,
I am wondering where to find the raw datasets from where the test samples are based on ?
Is the information (metadata) is available somewhere ?
It is a general question, not specific to any snakemake workflow.
Thanks !
Hey, I'm currently developing the metagenomic pipeline ATLAS. I'm searching to be more compliant with your guidelines, and maybe pack some parts in wrappers. One problem that I'm facing is that you don't know if the user is using single-end or paired-end reads.
In addition, trough quality filtering you might end with reads that lost their mate (singletons). If you don't want to lose them you will have three files from the initial two files. If you merge paired-end reads you also end up with additional files with reads, which don't have the same length distribution.
If you want to keep them separate you might end up with 4 files for the same reads.
It seems that most wrappers are not made to handle this different number of reads nor to distinguish between them. Any Idea on how to solve this issue?
In the Atlas pipeline, we solved the issue by checking at the beginning if the sample is single-end or paired-end and then input functions.
I'm putting together my first dummy pipeline, and I have toy data that is small enough to store alongside the workflow. Where is the "best practices" spot to put it? I've been looking around the examples but it seems like most don't provide data, or download from a remote. Thank you!
Hello fellow snakemakers!
I was thinking about how I could get the installation and use of our workflow to be even easier and thought of using pypi as another repo. This would mean that the workflow would now also be a python package.
It comes with a few neat things, one of them is that you can run a workflow from anywhere as an executable passing all arguments to snakemake as usual.
Instead of having to be in the workflow folder and run snakemake --directory WORKINGDIR
you can run workflow_name --directory WORKINGDIR
from anywhere and it will run the same thing.
Here is the code of the main.py file
import sys
import subprocess
import os
def main():
arguments = ['snakemake']
arguments.extend(sys.argv[1:])
subprocess.run(args=arguments, cwd=os.path.join(os.path.dirname(__file__),'..'))
if __name__=='__main__':
main()
Another thing that I wanted to implement is a "prepare" functionality which would be an interactive question/answer in the prompt to generate samples.csv
and configuration files. In our workflow, I would, for example, ask "which chemistry have you used for your experiment" giving a list of available choices directly in the prompt. This would then write the correct chemistry directly into the config.yaml instead of relying on people to write it themselves. This would ensure that there are no spelling mistakes and check for int types for example.
This preparing was always "odd" in a workflow because normally you rely on having everything setup before running it. Although the new checkpoints might fit the bill?
One potential issue with the executable approach is that it might become confusing to the users if you have both the executable and the classic workflow available and the fact that the workflow would be in the lib dir of python and not cloned somewhere.
Since this is a big/strange shift I wanted to have your opinion before moving forward with this experiment. Let me know if I'm missing any crucial problems.
Best wishes
I'm developing metagenome atlas a snakemake pipeline which for metagenomics which get s you all the steps from QC, assembly, binning, genome prediction, and annotation.
I make a click wrapper to get users started in three commands.
Now I'm planning how to maintain this pipeline and thinking I should modularize it to be able to update and test parts
I read the snakemake docs about modularisation.
Now my question is how I do it the best way so that the sub-workflows can work as stand-alone workflows but also as part of metagnome atlas?
I deleted the two results folders and re-ran the full script but the bam's do not get indexed
What I am missing?
Thanks
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
> snakemake sorted_reads/{A,B}.bam
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
2 samtools_sort
2
[Mon Oct 22 11:40:17 2018]
rule samtools_sort:
input: mapped_reads/B.bam
output: sorted_reads/B.bam
jobid: 0
wildcards: sample=B
[Mon Oct 22 11:40:17 2018]
Finished job 0.
1 of 2 steps (50%) done
[Mon Oct 22 11:40:17 2018]
rule samtools_sort:
input: mapped_reads/A.bam
output: sorted_reads/A.bam
jobid: 1
wildcards: sample=A
[Mon Oct 22 11:40:17 2018]
Finished job 1.
2 of 2 steps (100%) done
Complete log: /data/NC_projects/snakemake-tutorial/.snakemake/log/2018-10-22T114017.262317.snakemake.log
# is leading to
> ls -lah sorted_reads/
total 4.4M
drwxr-xr-x 2 u0002316 domain users 4.0K Oct 22 11:30 .
drwxr-xr-x 6 u0002316 domain users 4.0K Oct 22 11:39 ..
-rw-r--r-- 1 u0002316 domain users 2.2M Oct 22 11:30 A.bam
-rw-r--r-- 1 u0002316 domain users 2.2M Oct 22 11:30 B.bam
I have created a repository, with a Snakemake workflow, used to trim data generated using Accel Amplicon Panel data using the recommended guidelines provided by Swift Biosciences
https://github.com/clinical-genomics-uppsala/accel_amplicon_trimming
I would like to discuss what is the best way to specify files in a way that they can be used across workflows.
Take the example of two workflows e.g
Workflow 1: reads --> assembly
Workflow 2: assembly + reads --> assembly statistics ...
What is the best way to specify the reads and assembly so that they can be used by different workflows?
Take into account that
Requirement A: The reads might be used at multiple places in Workflow 2.
Requirement B : The reads are probably to be used to infer the total number of samples in the target rule.
With sub-workflows, it would be possible to define otherworkflow(file)
But I think the recommended way now is to use modules and to import the rules Workflow 1 and 2 in a new workflow.
But then I should know which rules I need to modify to adapt the file specification. This should be necessarily defined in the Readme of a workflow.
I don't see how this can be done without massive modifying many rules of an imported workflow.
Any thoughts?
Hello,
the latest version of my pipeline is trying to make it as a snakemake workflow.
I'm kindly asking for a review.
I have not yet worked on specific envs for each rule but this can be done in the future without too much effort.
Please tell me if there is anything else that I need to implement to pass the review.
Best wishes
We are hoping to publish it as an applications note. Work in progress here:
https://github.com/biocore-ntnu/chip_seq_pipeline
Just a heads-up.
It is massive so reviewing it will probably be hard :/ Any feedback appreciated though :) Going to focus hard on finishing the docs and getting a beta out.
Edit: I see that I do not have an integration test, just plenty of dryrun tests testing the DAG logic. I started this before learning about this repo, so I might not be following best practices :/
I'm trying to get Snakemake working with Singularity, and I need to debug the Singularity command, but the script wrapper that does the execution doesn't exist after the failure, e.g., I need to test:
singularity exec --home /home/vanessa/Documents/Dropbox/Code/labs/cherry/snakemake/encode-demo-workflow --bind /home/vanessa/anaconda3/lib/python3.7/site-packages:/mnt/snakemake /home/vanessa/Documents/Dropbox/Code/labs/cherry/snakemake/encode-demo-workflow/.snakemake/singularity/328e6123b3d8f239ce917fa97ccbbd80.simg bash -c 'set -euo pipefail; python /home/vanessa/Documents/Dropbox/Code/labs/cherry/snakemake/encode-demo-workflow/.snakemake/scripts/tmp2navashs.wrapper.py'
And why is home being force bound to be the present working directory? That seems strange for Singularity. Should it be --pwd
?
Hi,
After running the command
conda env create --name snakemake-tutorial --file environment.yaml
I end up with the error:
SpecNotFound: Invalid name, try the format: user/package
Do you have any suggestions?
Thank you!
This is a place for general, project-related discussions.
Couple questions, could you explain the first column of the unit patterns tsv file? The doc example looks sort of like a regex but the <nr>
part is not.
Also, what if you don’t have unit components in your FASTQ file names? It’s quite common to only have one-to-one map between FASTQ and sample without any lane info. Is the workflow structure designed to also work with only sample names and without resorting to inventing units/lanes in the unit patterns file and manually changing FASTQ filenames to include these IDs?
snakemake workflows is a great idea! Thanks for putting it together.
What do you think about having a command line tool for running any snakemake workflow with the following user API
sworkflow config.yaml [other snakemake flags]
config.yaml
would be the standard config.yaml
with an additional workflow
entry specifying which workflow to use (as a string):
workflow: https://github.com/snakemake-workflows/single-cell-rna-seq/tree/a1be3b6b389b009d91bb1d7f75abc1b5a23cd19d
# The usual config.yaml ----------------------------------
# path to sheet describing each cell.
cells: cells.tsv
# specify count table (rows: genes/transcripts/spikes, cols: cells)
counts: counts.tsv
...
This functionality is conceptually similar to snakemake rule wrappers, where you refer to a command with a single string.
sworkflow
command would do the following:
~/.snakemake/workflows/<myworkflow>
?)snakemake --snakefile ~/.snakemake/workflows/<myworkflow>/Snakefile [other snakemake flags]
This command would come handy in case you want to apply a single workflow multiple times (say you are analyzing different but related datasets). In the current case, you'd need to checkout the source-code to each directory.
Please post below to request a review.
Hi There,
I've followed all of the install instructions, and now I seem to be running into permission issues. My yaml files all seem to be working correctly now, and it's successfully create a pool.
But some of the rules at the beginning of the Snakemake file seem to be causing a permission issues when creating directories.
Provided is the error I'm getting:
I've attached my snakefile and yaml files for the batch-shipyard configuration. Some help would be really appreciated on this! I've followed the installation guide very carefully so I think it's something specific with Snakemake. What's odd is that it works no problem when installing snakemake and the dependencies in Conda.
Snakefile for the RNA-Seq analysis pipeline using test data from zebrafish
You should not need to edit this file unless you are changing the programs in the pipeline
configfile: "config_zebrafish.yaml"
SAMPLES = config['samples']
R1_suffix=config['input_file_R1_suffix']
R2_suffix=config['input_file_R2_suffix']
genome_fasta_file = config['genome_fasta_file']
genome_index_base = config['genome_index_base']
merged_transcripts_file=config['merged_transcripts_file']
rule trim_and_qc_all:
input:
html=expand("{sample}_R1.trimmed_paired_fastqc.html", sample=SAMPLES)
rule trim_reads:
input:
R1_reads="data/{sample}" + R1_suffix,
R2_reads="data/{sample}" + R2_suffix
output:
"1_trimmed_reads/{sample}_R1.trimmed_paired.fastq",
"1_trimmed_reads/{sample}_R1.trimmed_unpaired.fastq",
"1_trimmed_reads/{sample}_R2.trimmed_paired.fastq",
"1_trimmed_reads/{sample}_R2.trimmed_unpaired.fastq"
threads: config['threads']
params:
run_params=config['trimmomatic_params']
shell:
"echo -e "#!/usr/bin/env bash\ncd $FILESHARE;\n trimmomatic PE -threads {threads} {input.R1_reads} {input.R2_reads} {output}" > $FILESHARE/jobrun.sh ;\n $SHIPYARD/shipyard jobs add --configdir $FILESHARE/azurebatch --tail stderr.txt\n"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.