Comments (7)
Hi @moldach ,
seems like parallelizing over chromosomes is really an enhancement here. Did you think about using the chromosomes as a wildcard? Then, snakemake would run every chromosome in a different job, without the need of xargs and the likes.
Additionally, you could also pipe the output from samtools mpileup
to varscan pileup2snp
between rules, so you can actually use a separate rule for each of those two steps.
from snakemake-wrappers.
Hi @jafors thanks for getting back to me.
I've tried your suggestion of pipe the output from samtools mpileup
to varscan pileup2snp
between rules and adding another rule to combine the output of all these
rule mpilup_I:
input:
bam=lambda wildcards: getDeduppedBams(wildcards.sample),
reference_genome=os.path.join(dirs_dict["REF_DIR"],config["REF_GENOME"])
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_I.mpileup.gz")
log:
os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_I_samtools_mpileup.log")
params:
extra="-r I", # optional
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/samtools/mpileup"
rule mpileup_to_vcf_I:
input:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_I.mpileup.gz")
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_I.vcf")
message:
"Calling SNP with Varscan2"
threads:
2 # Keep threading value to one for unzipped mpileup input
# Set it to two for zipped mipileup files
log:
os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"varscan_{sample}_I.log")
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/varscan/mpileup2snp"
...
and then combine the output of these rules at the end
rule vcf_merge:
input:
I = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_I.vcf"),
II = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_II.vcf"),
III = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_III.vcf"),
IV = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_IV.vcf"),
V = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_V.vcf"),
X = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_X.vcf"),
MtDNA = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_MtDNA.vcf")
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_sorted_dedupped_snp_varscan.vcf")
log: os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_vcf-merge.log")
resources:
mem = 1000,
time = 10
threads: 1
message: """--- Merge VarScan by Chromosome."""
shell: """
awk 'FNR==1 && NR!=1 {{ while (/^<header>/) getline; }} 1 {{print}} ' {input.I} {input.II} {input.III} {input.IV} {input.V} {input.X} {input.MtDNA} > {output}
"""
We don't have to list the temporary files in rule all
- just the final merged .vcf
:
rule all:
input:
expand('{CALLING_DIR}/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.{ext}', CALLING_DIR=dirs_dict["CALLING_DIR"], CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names, ext=['vcf']),
For C. elegans it produces a DAG, like:
(snakemake) [moldach@arc wrappers]$ snakemake --use-conda --profile slurm --jobs 13
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 13
Job counts:
count jobs
1 all
1 mpileup_to_vcf_I
1 mpileup_to_vcf_II
1 mpileup_to_vcf_III
1 mpileup_to_vcf_IV
1 mpileup_to_vcf_MtDNA
1 mpileup_to_vcf_V
1 mpileup_to_vcf_X
1 mpilup_I
1 mpilup_II
1 mpilup_III
1 mpilup_IV
1 mpilup_MtDNA
1 mpilup_V
1 mpilup_X
1 vcf_merge
16
However, forever the Human genome this would require many more rules.
Is there a way to do this more succinctly?
from snakemake-wrappers.
There is indeed a way to improve on this when you replace the specific contigs with a wildcard.
You can expand in the merging rule:
input:
variants=expand(os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"], "{{sample}}_{contig}.vcf"), contig=["I", "II"]) # and so on
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_sorted_dedupped_snp_varscan.vcf")
Then, you need the other mpileup
and mpileup_to_vcf
only once, replacing the "I"
with {contig}
. In the params
of rule mpileup
, you can do:
params:
extra=lambda wc: "-r {}".format(wc.contig)
from snakemake-wrappers.
Thanks for the suggestion.
I've tried the following but get an error:
rule mpilup:
input:
bam=lambda wildcards: getDeduppedBams(wildcards.sample),
reference_genome=os.path.join(dirs_dict["REF_DIR"],config["REF_GENOME"])
output:
expand(os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"), sample=sample_names, contig=["I","II","III","IV","V","X","MtDNA"])
log:
expand(os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_{contig}_samtools_mpileup.log"), sample=sample_names, contig=["I","II","III","IV","V","X","MtDNA"])
params:
extra=lambda wc: "-r {}".format(wc.contig)
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/samtools/mpileup"
rule mpileup_to_vcf:
input:
expand(os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"), sample=sample_names, contig=["I","II","III","IV","V","X","MtDNA"])
output:
expand(os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.vcf"), sample=sample_names, contig=["I","II","III","IV","V","X","MtDNA"])
message:
"Calling SNP with Varscan2"
threads:
2 # Keep threading value to one for unzipped mpileup input
# Set it to two for zipped mipileup files
log:
expand(os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"varscan_{sample}_{contig}.log"), sample=sample_names, contig=["I","II","III","IV","V","X","MtDNA"])
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/varscan/mpileup2snp"
Error
(snakemake) [moldach@arc wrappers]$ snakemake -n -r
Workflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
InputFunctionException in line 341 of /home/moldach/wrappers/Snakefile:
Error:
AttributeError: 'Wildcards' object has no attribute 'sample'
Wildcards:
Traceback:
File "/home/moldach/wrappers/Snakefile", line 343, in <lambda>
Any idea what could be wrong?
from snakemake-wrappers.
vim +341 Snakefile
takes me to:
rule mpilup:
input:
bam=lambda wildcards: getDeduppedBams(wildcards.sample),
reference_genome=os.path.join(dirs_dict["REF_DIR"],config["REF_GENOME"])
output:
expand(os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"), sample=sample_names, contig=["I","II","III","IV","V","X","MtDNA"])
log:
expand(os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_{contig}_samtools_mpileup.log"), sample=sample_names, contig=["I","II","III","IV","V","X","MtDNA"])
params:
extra=lambda wc: "-r {}".format(wc.contig)
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/samtools/mpileup"
from snakemake-wrappers.
There is no need to expand in the two mpileup-related rules. Those rules will run once for all contigs over all samples. Just write
rule mpileup_to_vcf:
input:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz")
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.vcf")
and the contigs will be determined by the expand()
in rule vcf_merge
similar to the samples which are determined by the expand()
in rule all
.
In rule mpileup
, you do the same and remove the expand from the output and the log fields.
from snakemake-wrappers.
Hi @jafors thank you very much for you help. This works wonderfully.
However, I've noticed that the output of these two rules creates downstream issues with another rule #167
from snakemake-wrappers.
Related Issues (20)
- Sickle se/pe HOT 1
- The ensembl-annotation uses the wrong download URL HOT 3
- Error while testing bbduk HOT 1
- Salmon Quant wrapper throwing segmentation fault HOT 1
- MACS2: support --extsize with fragment size input file
- bwa-memx_index meme wrapper expects specific file name? HOT 3
- Possible bowtie2 align bug HOT 4
- Generic subfolder? HOT 1
- STAR wrapper allowing different output names HOT 3
- CI: autobump workflow never finishes HOT 2
- GFFREAD wrapper produces gff output when fasta file requested HOT 1
- building the docs on a fork doesn't work HOT 2
- FastQC resource allocation not working as expected - get_mem fix? HOT 3
- MultiQC wrapper env issue with imp HOT 1
- Wrappers for Google Batch HOT 3
- ModuleNotFoundError: No module named 'snakemake_wrapper_utils HOT 10
- `bcftools reheader` no longer works with map-style `sample.tsv`
- `bcftools merge`: not able to declare dependency on index files HOT 1
- `deepvariant`: conda environment not installable HOT 3
- change to earlier docs versions on readthedocs HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from snakemake-wrappers.