Giter Site home page Giter Site logo

snakemake / snakemake-wrappers Goto Github PK

View Code? Open in Web Editor NEW
206.0 7.0 176.0 298.54 MB

This is the development home of the Snakemake wrapper repository, see

Home Page: https://snakemake-wrappers.readthedocs.io

Python 35.75% R 1.83% Pep8 13.18% Perl 14.60% QMake 0.01% Nextflow 0.01% Verilog 0.40% mupad 0.49% CAP CDS 33.74% eC 0.01% Visual Basic 6.0 0.01%

snakemake-wrappers's Introduction

Snakemake GitHub actions status

The Snakemake Wrapper Repository

The Snakemake Wrapper Repository is a collection of reusable wrappers that allow to quickly use popular command line tools from Snakemake rules and workflows.

Visit https://snakemake-wrappers.readthedocs.io for more information.

snakemake-wrappers's People

Contributors

antoniev avatar beatrice-tan avatar bluegenes avatar bow avatar bpow avatar brcopeland avatar cbphk avatar christopher-schroeder avatar cpauvert avatar currocam avatar daler avatar dlaehnemann avatar felixmoelder avatar fgvieira avatar flatberg avatar fxwiegand avatar github-actions[bot] avatar jafors avatar jakevc avatar johanneskoester avatar mbhall88 avatar nikostr avatar phlya avatar smeds avatar snakedeploy-bot[bot] avatar tdayris avatar tdayris-perso avatar tdido avatar tpoorten avatar williamrowell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

snakemake-wrappers's Issues

Display shell command when `-p` is given.

snakemake --version
5.10.0

Not sure if this belongs here, or on the general snakemake page, but it would be really nice to see the shell command for the snakemake wrapper rules when using the -p parameter to submit the job.

Modularity of wrappers?

In the same spirit of having https://github.com/snakemake-workflows, is there any concern here about maintaining versions / containers for wrappers? I just cloned the repository locally and it wasn't tiny, but my larger concern is that if we also serve containers from here (that have automated builds for the wrappers) that will be more complicated than having a https://github.com/snakemake-wrappers organization, and storing wrappers under the complete uri. E.g., the current bio/arriba would be at https://github.com/snakemake-wrappers/bio/arriba. If some web interface (or similar) is then desired to show the wrappers, it could be rendered at snakemake-wrappers.github.io, and use a simple query to the API to update the listing.

SonarCloud Code Inspections profile is very confusing

SonarCloud inspections aren't configured properly for this project, so current master and all recent wrappers releases (e.g. v0.41.0) are marked as "Failed" by SonarCloud
image

In detail, SonarCloud doesn't understand that snakemake is already available in all wrapper.py files and shoudn't be marked as blocker error due to undefined variable

image

Such SonarCloud profile marks pull requests into this repo as failed not due to bugs, but due to false positives.

Joint wrapper for samse and sampe

Is your feature request related to a problem? Please describe.
Right now, bwa samse and sampe are two different wrappers. I feel this is a bit unnecessary, since the output is only one and the input can just be an array of fastq and another of sai.

Describe the solution you'd like
A single wrapper for both, where the algorithm was chosen by a param?

    params:
        index="genome",
        extra=r"-r '@RG\tID:{sample}\tSM:{sample}'", # optional: Extra parameters for bwa.
        sort="none",                                 # optional: Enable sorting. Possible values: 'none', 'samtools' or 'picard'`
        sort_order="queryname",                      # optional: Sort by 'queryname' or 'coordinate'
        sort_extra="",                               # optional: extra arguments for samtools/picard
        alg="sampe"

or

    params:
        alg = lambda wildcards, input: "samse" if len(input.sai) == 1 else "sampe",

or even have it being auto-detected inside the wrapper.

Does this make sense?

FastQC wrapper doesn't support *.fq.gz reads after #14 fix

Snakemake version
Snakemake wrappers version 0.64.0,
Snakemake version any (e.g. 5.26.1)

Describe the bug

FASTA or FASTQ files could be often be named like *.fq.gz. Starting from 0.50.0 version fastqc wrapper doesn't support such naming conventions and always fails for such files. It happens because now FASTQC tools and wrapper expect different name for result file and as a result, the wrapper cannot find the report file.

E.g. for foo_1.fq.gz FASTQC generates foo_1_fastqc.html report. Wrappers < 0.50.0 also uses the same naming conventions, but after #14 fix the wrapper expects foo_1.fq_fastqc.html which doesn't exist, so mv command fails with file not found error.

Logs

fastqc  --quiet -t 1 --outdir /scratch1/fs1/martyomov/rcherniatchik/pilot_small/tmp/tmp7kg3b6vx bams/A01/V300010918_L4_HUMvotMAAAAAAA-1_hg38_unmapped_reads_1.fq.gz  2> qc/bams_unmapped/A01/V300010918_L4_HUMvotMAAAAAAA-1_hg38_unmapped_reads_1_fastqc.log
mv /scratch1/fs1/martyomov/rcherniatchik/pilot_small/tmp/tmp7kg3b6vx/V300010918_L4_HUMvotMAAAAAAA-1_hg38_unmapped_reads_1.fq_fastqc.html qc/bams_unmapped/A01/V300010918_L4_HUMvotMAAAAAAA-1_hg38_unmapped_reads_1_fastqc.html
mv: cannot stat '/scratch1/fs1/martyomov/rcherniatchik/pilot_small/tmp/tmp7kg3b6vx/V300010918_L4_HUMvotMAAAAAAA-1_hg38_unmapped_reads_1.fq_fastqc.html': No such file or directory
Traceback (most recent call last):
  File "/scratch1/fs1/martyomov/rcherniatchik/pilot_small/.snakemake/scripts/tmpgp95haer.wrapper.py", line 47, in <module>
    shell("mv {html_path} {snakemake.output.html}")
  File "/opt/conda/lib/python3.7/site-packages/snakemake/shell.py", line 176, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail;  mv /scratch1/fs1/martyomov/rcherniatchik/pilot_small/tmp/tmp7kg3b6vx/V300010918_L4_HUMvotMAAAAAAA-1_hg38_unmapped_reads_1.fq_fastqc.html qc/bams_unmapped/A01/V300010918_L4_HUMvotMAAAAAAA-1_hg38_unmapped_reads_1_fastqc.html' returned non-zero exit status 1.

Work dir folder content:

$ ls /scratch1/fs1/martyomov/rcherniatchik/pilot_small/tmp/tmp7kg3b6vx/

V300010918_L4_HUMvotMAAAAAAA-1_hg38_unmapped_reads_1_fastqc.html 
V300010918_L4_HUMvotMAAAAAAA-1_hg38_unmapped_reads_1_fastqc.zip

Minimal example

rule fastqc:
    input:
        "reads/{sample}.fq.gz"
    output:
        html="qc/fastqc/{sample}.html",
        zip="qc/fastqc/{sample}_fastqc.zip"
    log:
        "logs/fastqc/{sample}.log"
    threads: 1
    wrapper:
        "master/bio/fastqc"

Additional context
Before #14 basename_without_ext method was:

   split_ind = 2 if base.endswith(".gz") else 1
    base = ".".join(base.split(".")[:-split_ind])

after:

    split_ind = 2 if base.endswith(".fastq.gz") else 1
    base = ".".join(base.split(".")[:-split_ind])

VEP cache: specify output folder

Snakemake version
snakemake: 5.25.0
snakemake-wrappers: 0.65.0

Describe the bug
Wrapper vep/cache has, as single output, a path where to download the cache:
https://github.com/snakemake/snakemake-wrappers/blob/master/bio/vep/cache/test/Snakefile#L3

Inside this folder, the wrapper will create folders with species, build and release. However, these folders are not part of the rule output path, so the rule cannot properly check if the output exists or not. That is, if the pipeline is run twice with different genomes, the second time will use the cache of the first run.

Suggested fix
I'd suggest having the species/build/release as wildcards and part of the output folder. The wrapper would have to make sure the output path is valid (would have to end in {species}/{release}_{build}), download the cache to a temp folder and then move to the right output folder; for example:

rule get_vep_cache:
    output:
        directory("resources/vep/cache/{species}/{release}_{build}")
    params:
	species = lambda wildcards: wildcards.species,
	build = lambda wildcards: wildcards.build,
	release = lambda wildcards: wildcards.release,
    log:
        "logs/vep/cache/{species}/{release}_{build}.log"
    wrapper:
        "master/bio/vep/cache"

The params could also be deprecated, if the wrapper can get species/build/release from the wildcards.

The vep annotate rule would take the full path to the cache and infer species/build/release.

Versions, citations and references for tools being used

Is your feature request related to a problem? Please describe.
I am currently writing a manuscript that uses snakemake with a lot of wrapper scripts, and have to go through all tools individually to get (a) the exact version of the tool, and (b) then look up its reference for the bibliography.

My process is:

  • Go through all my snakemake files and all rules and tools being used. For each of them:
  • Find the wrapper version:
    wrapper:
            "0.51.3/bio/trimmomatic/pe"
    
  • Find the documentation at that exact version of the wrapper: https://snakemake-wrappers.readthedocs.io/en/0.51.3/wrappers/trimmomatic/pe.html
  • Get the version of the tool being used from there.
  • Do a Google Scholar search for the tool to get its preferred citation/reference.
  • Then, put this into text: "We used trimmomatic v0.36 [1]", etc...

This is repetitive work, not only for me, but for everyone who is in a similar situation. The two pieces of information that I am interested in here are the version of the tool being used, and its reference.

Describe the solution you'd like
Ideally, snakemake could offer some command that takes a config file/workflow, and figures out all tools being used in that exact workflow configuration (skipping any rules that are not being invoked!). It then automatically collects the needed information, and prints it in some format.

Tool version information is already available via the environment.yaml in the wrapper, and citation or links to the tool websites could be added (optionally and bit by bit for each tool) to the meta.yaml and read from there.

I guess that this is only feasible for wrapped tools, and that I still have to go through my individual (shell/script based) rules. But the above snakemake command then could at least also list all those rules, so that I do not forget about them. Basically, it prints out a linear list of the dependency graph (similar to the normal terminal output when running snakemake, but only once for each rule), but with all tools and references listed.

So, my ideal solution would look something like this:

$ snakemake [...] --print-tools # or whatever you would want to call this command
Rule `trim_read`:
    - Uses wrapper `TRIMMOMATIC PE`
    - Tool: trimmomatic v0.36
    - Conda: https://anaconda.org/bioconda/trimmomatic
    - URL: http://www.usadellab.org/cms/index.php?page=trimmomatic
    - Reference: Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics, 30(15), 2114–2120. https://doi.org/10.1093/bioinformatics/btu170
Rule `my_rule`:
    - Uses shell-based command
Rule `my_other_rule`:
    - Uses python-based command
....

That would be awesome!

Describe alternatives you've considered
I thought that a simple overview table here in the repo might also be a solution, but that would get too messy rather quickly, with all wrappers in different versions etc... and it would not really speed things up, would be extra maintenance work, and would probably be outdated all the time, and... No, not a good alternative.

conda-forge is not priority channel for many wrappers

Is your feature request related to a problem? Please describe.

If I remember correctly the order of channels has a big impact on the performance of the install process.
conda-forge should be prioritized over bioconda.

many wrappers have it the other way around.

Describe the solution you'd like
have consistent channel priorities in al environement.aml files
If I remember correctly this is the optimal solution.

channels:
  - conda-forge
  - bioconda
  - defaults

Describe alternatives you've considered
I know mamba is another solution

Additional context

I can help, if this change is accepted.

Error occurred during initialization of VM java/lang/NoClassDefFoundError: java/lang/Object

Snakemake version

5.8.1

NOTE: I will go over why I'm not using the latest version in the additional context section

Wrapper version

0.64.0/bio/fastqc

Describe the bug

I'm getting the following error trying to use the fastqc wrapper:

Error occurred during initialization of VM 

Logs

Building DAG of jobs...
Using shell: /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       qc_before_trim_r2
        1

[Wed Aug 26 09:34:45 2020]
Job 0: --- Quality check of raw data with FastQC before trimming.

python /scratch/moldach/MADDOG/VCF-FILES/biostars439754/MTG353/.snakemake/scripts/tmpb0_czsv2.wrapper.py
Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/MTG353/.snakemake/conda/91998b6c
fastqc  --quiet -t 1 --outdir /tmp/tmpjnd9fio7 /home/moldach/projects/def-mtarailo/common/WGS_6/MTG324/MTG324_R2.fastq.gz ' 2> logs/fastqc/MTG324_R2.log'
Error occurred during initialization of VM
java/lang/NoClassDefFoundError: java/lang/Object
Traceback (most recent call last):
  File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/MTG353/.snakemake/scripts/tmpb0_czsv2.wrapper.py", line 35, in <module>
    shell(
  File "/home/moldach/bin/snakemake/lib/python3.8/site-packages/snakemake/shell.py", line 156, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail;  fastqc  --quiet -t 1 --outdir /tmp/tmpjnd9fio7 /home/moldach/projects/def-mtarailo/common/WGS_6/MTG324/MTG324_R2.fastq.gz ' 2> logs$
[Wed Aug 26 09:35:09 2020]
Error in rule qc_before_trim_r2:
    jobid: 0
    output: qc/fastQC/before_trim/MTG324_R2_fastqc.html, qc/fastQC/before_trim/MTG324_R2_fastqc.zip
    log: logs/fastqc/MTG324_R2.log (check log file(s) for error message)
    conda-env: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/MTG353/.snakemake/conda/91998b6c

RuleException:
CalledProcessError in line 140 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/MTG353/Snakefile:
Command 'source /home/moldach/miniconda3/bin/activate '/scratch/moldach/MADDOG/VCF-FILES/biostars439754/MTG353/.snakemake/conda/91998b6c'; set -euo pipefail;  python /scratch/moldach/MADDOG/V$
  File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/MTG353/Snakefile", line 140, in __rule_qc_before_trim_r2
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.8.0/lib/python3.8/concurrent/futures/thread.py", line 57, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

Minimal example

The config.yaml

# Files
REF_GENOME: "c_elegans.PRJNA13758.WS265.genomic.fa"
GENOME_ANNOTATION: "c_elegans.PRJNA13758.WS265.annotations.gff3"

# Tools
QC_TOOL: "fastQC"
TRIM_TOOL: "trimmomatic"
ALIGN_TOOL: "bwa"
MARKDUP_TOOL: "picard"
CALLING_TOOL: "varscan"
ANNOT_TOOL: "vep"

The Snakefile

# Directories------------------------------------------------------------------
configfile: "config.yaml"

# Setting the names of all directories
dir_list = ["REF_DIR", "LOG_DIR", "BENCHMARK_DIR", "QC_DIR", "TRIM_DIR", "ALIGN_DIR", "MARKDUP_DIR", "CALLING_DIR", "ANNOT_DIR"]
dir_names = ["refs", "logs", "benchmarks", "qc", "trimming", "alignment", "mark_duplicates", "variant_calling", "annotation"]
dirs_dict = dict(zip(dir_list, dir_names))

import os
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_dict = dict(zip(sample_names, sample_locations))
# get number of samples
len_samples = len(sample_names)


# Rules -----------------------------------------------------------------------

rule all:
    input:
        expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip'])
		
		
def getHome(sample):
  return(list(os.path.join(samples_dict[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))

rule qc_before_trim_r1:
    input:
        r1 = lambda wildcards: getHome(wildcards.sample)[0]
    output:
        html=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R1_fastqc.html"),
        zip=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R1_fastqc.zip"),
    params: ""
    log:
        "logs/fastqc/{sample}_R1.log"
    resources:
        mem = 1000,
        time = 30
    threads: 1
    message: """--- Quality check of raw data with FastQC before trimming."""
    wrapper:
        "0.64.0/bio/fastqc"

rule qc_before_trim_r2:
    input:
        r1 = lambda wildcards: getHome(wildcards.sample)[1]
    output:
        html=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R2_fastqc.html"),
        zip=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R2_fastqc.zip"),
    params: ""
    log:
        "logs/fastqc/{sample}_R2.log"
    resources:
        mem = 1000,
        time = 30
    threads: 1
    message: """--- Quality check of raw data with FastQC before trimming."""
    wrapper:
        "0.64.0/bio/fastqc"

Additional context

It was mentioned in the preamble to this issue that I should I try the newest version of Snakemake. I downloaded the newest version via:

$ mamba create -c conda-forge -c bioconda -n snakemake snakemake
$ conda activate snakemake

But now when I try a dry-run I get a Segmentation fault

Using the older version of Snakemake (for comparison)

$ source ~/bin/snakemake/bin/activate
(snakemake) $ snakemake -n -r
Building DAG of jobs...
Job counts:
        count   jobs
        1       all
        1       qc_before_trim_r1
        1       qc_before_trim_r2
        3

[Wed Aug 26 10:28:41 2020]
Job 2: --- Quality check of raw data with FastQC before trimming.
Reason: Missing output files: qc/fastQC/before_trim/MTG324_R2_fastqc.html, qc/fastQC/before_trim/MTG324_R2_fastqc.zip

[Wed Aug 26 10:28:41 2020]
Job 1: --- Quality check of raw data with FastQC before trimming.
Reason: Missing output files: qc/fastQC/before_trim/MTG324_R1_fastqc.zip, qc/fastQC/before_trim/MTG324_R1_fastqc.html

[Wed Aug 26 10:28:41 2020]
localrule all:
    input: qc/fastQC/before_trim/MTG324_R1_fastqc.html, qc/fastQC/before_trim/MTG324_R1_fastqc.zip, qc/fastQC/before_trim/MTG324_R2_fastqc.html, qc/fastQC/before_trim/MTG324_R2_fastqc.zip
    jobid: 0
    reason: Input files updated by another job: qc/fastQC/before_trim/MTG324_R2_fastqc.html, qc/fastQC/before_trim/MTG324_R2_fastqc.zip, qc/fastQC/before_trim/MTG324_R1_fastqc.zip, qc/fastQC/before_trim/MTG324_R1_fastqc.html

Job counts:
        count   jobs
        1       all
        1       qc_before_trim_r1
        1       qc_before_trim_r2
        3
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

$ deactivate

Newest version of Snakemake

$ conda activate snakemake
$ conda --version
5.23.0
$ snakemake -n -r
Segmentation fault

I guess other pertinent information is that I'm on an academic HPC with a SLURM scheduler.

$ cat /etc/os-release

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

The issue I see with me using the newest Snakemake version is the following:

If I were to run an interactive job (to get more memory) (via salloc --time=1:0:0 --mem=1000) and then try to submit a job (which is a full pipeline consisting of many wrappers) (via bash -c "nohup snakemake --profile slurm --use-conda --jobs 500 &") it would only run jobs as long as the interactive job was set for.

As I understand it, Snakemake needs to be run from the head node - it submit jobs to the SLURM scheduler.

Is it possible that Snakemake version 5.23.0 is more memory intensive than 5.8.1? And if so does this preclude me from using it?

No "code" displayed in wrappers' documentation

Snakemake-wrappers 0.66.0, snakemake 5.24.0
The disparition of the displayed code appeared after addition of meta wrappers. So, the latest version of snakemake-wrappers

Describe the bug
There is no code displayed anymore in any wrapper's documentation. The code is still displayed in meta-wrappers.

Minimal example
See missing documentation in:
https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/bedtools/slop.html#code

See available documentation in:
https://snakemake-wrappers.readthedocs.io/en/stable/meta-wrappers/star_arriba.html#code

Building documentation failed: No module named 'sphinx_copybutton'

Snakemake version
Snakemake 5.26.1 and snakemake-wrappers 0.66.0
Describe the bug
Building the documentation failed with an extension error No module named 'sphinx_copybutton'.
Logs

Extension error:
L'extension ne peut pas être importée sphinx_copybutton (exception: No module named 'sphinx_copybutton')
Makefile:20 : la recette pour la cible « html » a échouée
make: *** [html] Erreur 2

Minimal example

conda create -n test-snakemake-wrapper-docs sphinx sphinx_rtd_theme pyyaml
conda activate test-snakemake-wrapper-docs
cd docs
make html

Additional context
The code was executed within the test-snakemake-wrapper-docs environment as documented on the Testing locally section here

Support threads directive with fastqc

Is your feature request related to a problem? Please describe.

I was wondering why the fastqc wrapper is not using the threads directive unlike other tools such as bwa mem.

Describe the solution you'd like

Could the wrapper be modified to use the threads directive?

Describe alternatives you've considered

Right now I'm using the params directive with -t 8 alongside threads: 8 for it to work on my HPC infrastructure, but I would like to avoid hardcoding the number of threads I'm using.

Only first of samples names is passed in dada2 make-table wrapper

Snakemake version
current snakemake-wrappers version 0.67.0 (bbd7c0f)

Describe the bug
The rownames of the resulting sequence table should have the names c("a","b") but have c("a",NA) instead.

Minimal example
The minimal example for reproducing the bug is in the test directory of the wrapper.

Wrappers using unnamed params break cluster params

Some wrappers, e.g. samtools sort or samtools index consume params entirely to customize tool invocation. Others, like trim_galore-pe use a named param called extra.

The latter allows the specification of additional params for cluster execution as suggested in the documentation, the former breaks when additional named params are used. I'm not sure if there are use cases for additional params other than --cluster, as they are obviously not going to be used in the rule code itself.

cutadapt no such option -j

I've used snakemake version 5.19.2 and cutadapt wrapper version 0.60 (cutadapt version 1.9.1) .

I got an error when running my rule cutadapt that still worked a couple of months ago. From the logfile:
cutadapt: error: no such option: -j (the threads option)

rule in *.smk file:

rule cutadapt:
    input:
        "fastq/raw/{unit}.R1.fastq.gz"
    output:
        fastq=temp("fastq/trimmed/{unit}.R1.fastq.gz"),
        qc="qc/cutadapt/{unit}.txt"
    params:
        cutadapt_extra
    log:
        "logs/cutadapt/{unit}.log"
    wrapper:
        "0.60.0/bio/cutadapt/se"

Configfile:

    cutadapt:
        extra:
            - "-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"

FastQC wrapper sometimes get basename wrong

The logic for getting the basename (and thus the output filesnames) from FastQC doesn't quite match FastQC. For example:

Input file FastQC Wrapper FastQC
sample.fastq.gz sample_fastqc.html sample_fastqc.html
sample.fastqsanger.gz sample_fastqc.html sample.fastqsanger_fastqc.html

I think this file has the logic used: https://github.com/s-andrews/FastQC/blob/master/uk/ac/babraham/FastQC/Utilities/CasavaBasename.java

Thus something like:

split_ind = 2 if base.endswith(".fastq.gz") else 1

will work.

Conda fails to build remote env for custom wrapper. Works locally.

I've made a custom wrapper in a local repo. When I use the wrapper locally, the environment is built just fine and everything runs as expected. However, when I try to call it through github, conda will fail to build environment.yaml. For reference, I've been able to run wrappers for this test case and this test case for other wrappers without any issues.

I've tried modifying the environment file several different ways:

  • removing anything cuda related
  • removing the rapidsai channel and packages
  • removing the python version specification
  • removing the python dependency
  • removing all pip packages
  • modifying yaml tab spacing
  • removing the name argument
  • combinations of the above

I've even tried using an absolute path for the wrapper as opposed to specifying a prefix. But nothing seems to be working. Here's an example that replicates the error.

snakemake version: 5.26.1

Here's the Snakefile:

configfile: "config.yaml"

MODELS = config["NNI_MODEL_PARAMS"].keys()

rule optimize_hypperparameters:
    output:
        experiment   = expand("nni/{model}_experiments.yaml", 
                               model = MODELS),
                                
        search_space = expand("nni/search_space/{model}_search_space.json",
                               model = MODELS),
                               
        experiment_results = expand("nni/experiment_results/{model}.csv", 
                                    model = MODELS)
    wrapper:
        "https://github.com/DamLabResources/pipeline-multitool/tree/main/nni_automation"

Here's the output log:

Building DAG of jobs...
Creating conda environment https:/github.com/DamLabResources/pipeline-multitool/tree/main/nni_automation/environment.yaml...
Downloading and installing remote packages.
CreateCondaEnvironmentException:
Could not create conda environment from /tmp/tmpc5g0j_1m.yaml:

# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

    Traceback (most recent call last):
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/conda/exceptions.py", line 1079, in __call__
        return func(*args, **kwargs)
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/conda_env/cli/main.py", line 80, in do_call
        exit_code = getattr(module, func_name)(args, parser)
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/conda_env/cli/main_create.py", line 86, in execute
        spec = specs.detect(name=name, filename=filename, directory=os.getcwd())
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/conda_env/specs/__init__.py", line 43, in detect
        if spec.can_handle():
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/conda_env/specs/yaml_file.py", line 18, in can_handle
        self._environment = env.from_file(self.filename)
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/conda_env/env.py", line 157, in from_file
        return from_yaml(yamlstr, filename=filename)
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/conda_env/env.py", line 138, in from_yaml
        data = yaml_safe_load(yamlstr)
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/conda/common/serialize.py", line 67, in yaml_safe_load
        return yaml.safe_load(string, version="1.2")
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/ruamel_yaml/main.py", line 980, in safe_load
        return load(stream, SafeLoader, version)
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/ruamel_yaml/main.py", line 935, in load
        return loader._constructor.get_single_data()
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/ruamel_yaml/constructor.py", line 109, in get_single_data
        node = self.composer.get_single_node()
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/ruamel_yaml/composer.py", line 78, in get_single_node
        document = self.compose_document()
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/ruamel_yaml/composer.py", line 104, in compose_document
        self.parser.get_event()
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/ruamel_yaml/parser.py", line 163, in get_event
        self.current_event = self.state()
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/ruamel_yaml/parser.py", line 239, in parse_document_end
        token = self.scanner.peek_token()
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/ruamel_yaml/scanner.py", line 182, in peek_token
        self.fetch_more_tokens()
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/ruamel_yaml/scanner.py", line 282, in fetch_more_tokens
        return self.fetch_value()
      File "/home/robertlink/anaconda3/lib/python3.7/site-packages/ruamel_yaml/scanner.py", line 655, in fetch_value
        self.reader.get_mark(),
    ruamel_yaml.scanner.ScannerError: mapping values are not allowed here
      in "<unicode string>", line 148, column 34:
            <span style="background-color: #79b8ff;width: 0%;" class="Pro ... 
                                         ^ (line: 148)

`$ /home/robertlink/anaconda3/bin/conda-env create --quiet --file /home/robertlink/nni_wrapper_test/.snakemake/conda/2002495d.yaml --prefix /home/robertlink/nni_wrapper_test/.snakemake/conda/2002495d`

  environment variables:
                 CIO_TEST=<not set>
                 CNI_PATH=/home/robertlink/anaconda3/lib/cni
  CONDA_AUTO_UPDATE_CONDA=false
        CONDA_DEFAULT_ENV=base
                CONDA_EXE=/home/robertlink/anaconda3/bin/conda
CONDA_MKL_INTERFACE_LAYER_BACKUP=
             CONDA_PREFIX=/home/robertlink/anaconda3
    CONDA_PROMPT_MODIFIER=(base)
         CONDA_PYTHON_EXE=/home/robertlink/anaconda3/bin/python
               CONDA_ROOT=/home/robertlink/anaconda3
              CONDA_SHLVL=1
                CUDA_PATH=/home/robertlink/anaconda3
           CURL_CA_BUNDLE=<not set>
          LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:/usr/local/cuda/extras/CUPTI/lib64
                     PATH=/usr/local/cuda-10.2/bin:/home/robertlink/edirect:/home/robertlink/CNN
                          _promoter_prediction/Jurkat_CNN_transcriptome/Jurkat_RNA_Seq_Processin
                          g/Dart:/home/robertlink/sratoolkit.2.8.2-1-ubuntu64/bin:/home/robertli
                          nk/cd-hit-v4.8.1-2019-0228:/usr/local/cuda-10.0/bin:/home/robertlink/b
                          in:/home/robertlink/.local/bin:/usr/local/cuda-10.2/bin:/home/robertli
                          nk/anaconda3/bin:/home/robertlink/anaconda3/condabin:/home/robertlink/
                          edirect:/home/robertlink/CNN_promoter_prediction/Jurkat_CNN_transcript
                          ome/Jurkat_RNA_Seq_Processing/Dart:/home/robertlink/sratoolkit.2.8.2-1
                          -ubuntu64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:
                          /bin:/usr/games:/usr/local/games:/snap/bin
               PYTHONPATH=/home/robertlink/custom_python_modules:/home/robertlink/custom_python_
                          modules:
       REQUESTS_CA_BUNDLE=<not set>
            SSL_CERT_FILE=<not set>

     active environment : base
    active env location : /home/robertlink/anaconda3
            shell level : 1
       user config file : /home/robertlink/.condarc
 populated config files : 
          conda version : 4.8.5
    conda-build version : 3.18.11
         python version : 3.7.8.final.0
       virtual packages : __cuda=10.2
                          __glibc=2.27
       base environment : /home/robertlink/anaconda3  (writable)
           channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /home/robertlink/anaconda3/pkgs
                          /home/robertlink/.conda/pkgs
       envs directories : /home/robertlink/anaconda3/envs
                          /home/robertlink/.conda/envs
               platform : linux-64
             user-agent : conda/4.8.5 requests/2.24.0 CPython/3.7.8 Linux/5.4.0-47-generic ubuntu/18.04.4 glibc/2.27
                UID:GID : 1002:1002
             netrc file : None
           offline mode : False


An unexpected error has occurred. Conda has prepared the above report.


  File "/home/robertlink/anaconda3/lib/python3.7/site-packages/snakemake/deployment/conda.py", line 352, in create

Any thoughts or help would be very appreciated. Thanks!

Error executing rule get_vep_cache on cluster

Snakemake version

5.23.0

Describe the bug

Trying to get snakemake running on another server we have access to (this one has a LSF scheduler). Used the VEP CACHE wrapper example verbatim. as a minimal example.

Logs

Taking a look at less/gpfs/home/moldach/projects/saliva/SIMPLE_TEST/.snakemake/log/2020-08-27T163012.417959.snakemake.log

Workflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 1
Job counts:
        count   jobs
        1       all
        1       get_vep_cache
        2

[Thu Aug 27 16:30:13 2020]
rule get_vep_cache:
    output: resources/vep/cache
    log: logs/vep/cache.log
    jobid: 1

Submitted job 1 with external jobid '89368 logs/cluster/get_vep_cache/unique/jobid1_089efa0e-a6ea-47aa-9c02-c1ce6ad92cbe.out'.
[Thu Aug 27 16:30:33 2020]
Error in rule get_vep_cache:
    jobid: 1
    output: resources/vep/cache
    log: logs/vep/cache.log (check log file(s) for error message)
    conda-env: /gpfs/home/moldach/projects/saliva/SIMPLE_TEST/.snakemake/conda/60d0d409
    cluster_jobid: 89368 logs/cluster/get_vep_cache/unique/jobid1_089efa0e-a6ea-47aa-9c02-c1ce6ad92cbe.out

Error executing rule get_vep_cache on cluster (jobid: 1, external: 89368 logs/cluster/get_vep_cache/unique/jobid1_089efa0e-a6ea-47aa-9c02-c1ce6ad92cbe.out, jobscript: /gpfs/home/moldach/projects/saliva/SIMPLE_TEST/.snakemake/tmp.a24u6iry/snakejob.get_vep_cache.1.sh). For error details see the cluster log and the log files of the involved rule(s).
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /gpfs/home/moldach/projects/saliva/SIMPLE_TEST/.snakemake/log/2020-08-27T163012.417959.snakemake.log

Minimal example

The Snakefile:

rule all:
    input:
	"resources/vep/cache"

rule get_vep_cache:
    output:
	directory("resources/vep/cache")
    params:
	species="saccharomyces_cerevisiae",
        build="R64-1-1",
        release="98"
    log:
        "logs/vep/cache.log"
    cache: True
    wrapper:
	"0.64.0/bio/vep/cache"

The submit script:

bash -c "nohup snakemake --profile lsf --use-conda --jobs 1 &"

Additional context

$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Format with snakefmt

Would it be ok for me to create a PR where I add snakefmt to the CI checks and also format all current Snakfiles with it?

Options for wrappers

I'm struggling with editing snakemake/wrappers.py because I don't understand all the options - the default seems to be to specify the subfolder in the GitHub repository here, but that wouldn't make sense to have the testing files use it (unless they are merged to master) because we would always be testing the master branch. I was testing a local wrapper with a file://../wrapper.py (up one folder from the test Snakemake file) but that doesn't seem to work. It looks like it could also be specified as http: or file: or git+file: and I'm hoping someone can help me to clearly lay out how this is supposed to work so I can do the implementation (the current one doesn't work / make sense to me). Thanks!

Warning when building docs YAMLLoadWarning

Snakemake version
Snakemake-wrappers version 0.67.0 and Sphinx v3.2.1

Describe the bug
A warning is issued by Sphinx when building the docs locally or remotely with the Github Actions.

Logs

/home/runner/work/snakemake-wrappers/snakemake-wrappers/docs/generate_docs.py:72: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  env = yaml.load(env)

Can be found here as well.

Additional context
https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation#how-to-disable-the-warning

This is not a priority, I just stumble upon several times and decided to register it. I will probably PR a minor fix soon.
Best,
Charlie

FastQ_Screen Snakemake Wrapper Error

Hi,

Currently the FastQ_Screen wrapper defines the prefix after removing only the '.fastq' extension from the input file.

However, as shown below, fastq_screen also looks for other extensions ('.seq', '.txt' and '.fq')

Line 358 of fastq_screen:

  • $outfile =~ s/.(txt|seq|fastq|fq)$//i;

This means the wrapper will fail for FASTQ files ending in .fq, .txt or .seq as there will be a filename mismatch between the actual fastq_screen output and the anticipated output based on the prefix.

Thanks,
Stephen

Need to refactor parameter names

Snakemake version
5.8.1+12.g85a09c4

Describe the bug
Namedlist attributes are used in several wrappers. Originally reported here

Additional context
This is the result of making functions in Namedlist class read only. Any parameters with conflicting names need to be changed to function properly, e.g. index, sort, etc.

number of threads in bwa mem wrapper

Hi, apologies for this very basic question but I am wondering how snakemake deals with the threads parameter in the bwa mem wrapper. The wrapper uses this parameter to specify the number of threads used by bwa mem. But when also applying a sort the output is piped to either samtools or picard and this I assume also requires at least one additional thread.

So my question is whether the this causes a mismatch between the number of threads anticipated by snakemake (i.e. the threads parameter) and the actual threads used (i.e. the thread parameter + 1) and if this could have unwanted effects?

Thanks

"stable" tag on readthedocs does not display the "Code" blocks

Snakemake version
Stable release of wrappers - 0.66.0

Describe the bug
When viewing the documentation on readthedocs, the stable tag doesn't show the code blocks. Other version tags and the latest tag does.

Logs
Did not try building the docs, so I'm not sure if there are useful log messages.

Minimal example
image

utility functions for general-purpose wrapper setup (e.g. JVM memory option handling)

Is your feature request related to a problem? Please describe.
@tdayris is currently working on unifying the handling of Java memory specifications in wrappers where the bioconda recipes allow for this. This leads to rather large blocks of code that is repeated across all the respective recipes and whenever we want to change the strategy of memory handling, we will have to edit all those places in sync. See here:
#204 (comment)

Describe the solution you'd like
I would like to have some kind of common or utils module that provides useful functions. In the case above, it would e.g. need to take the snakemake object and a java_opts string variable as parameters and would adjust the java_opts string accordingly with the Java memory option.

Describe alternatives you've considered
The alternative is keeping all those copies of code around. This might make the wrappers more readable, as no reference to another code file is needed. But as described above, it incurs the maintenance burden of having to keep all those code blocks in sync.

A number of parameters are not working for the Trimmomatic PE wrapper

Previously my Trimmomatic shell command included the following trimmers:

ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

However, only LEADING:3 and MINLEN:36 are working in the params: trimmer?

rule trimming:
    input:
	    r1 = lambda wildcards: getHome(wildcards.sample)[0],
        r2 = lambda wildcards: getHome(wildcards.sample)[1]
    output:
        r1 = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"],"{sample}_R1_trim_paired.fastq.gz"),
        r1_unpaired = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"],"{sample}_R1_trim_unpaired.fastq.gz"),
        r2 = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"],"{sample}_R2_trim_paired.fastq.gz"),
        r2_unpaired = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"],"{sample}_R2_trim_unpaired.fastq.gz")
    log: os.path.join(dirs_dict["LOG_DIR"],config["TRIM_TOOL"],"{sample}.log")
    threads: 32
    params:
	# list of trimmers (see manual)
        trimmer=["LEADING:3", "MINLEN:36"],
        # optional parameters
        extra="",
        compression_level="-9"
    resources:
	mem = 1000,
        time = 120
    message: """--- Trimming FASTQ files with Trimmomatic."""
    wrapper:
	"0.64.0/bio/trimmomatic/pe"

When trying to use any of the other parameters (ILLUMINACLIP:adapters.fa:2:30:10 TRAILING:3 SLIDINGWINDOW:4:15) it fails.

For example, trying only TRAILING:3:

rule trimming:
    input:
	    r1 = lambda wildcards: getHome(wildcards.sample)[0],
        r2 = lambda wildcards: getHome(wildcards.sample)[1]
    output:
        r1 = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"],"{sample}_R1_trim_paired.fastq.gz"),
        r1_unpaired = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"],"{sample}_R1_trim_unpaired.fastq.gz"),
        r2 = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"],"{sample}_R2_trim_paired.fastq.gz"),
        r2_unpaired = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"],"{sample}_R2_trim_unpaired.fastq.gz")
    log: os.path.join(dirs_dict["LOG_DIR"],config["TRIM_TOOL"],"{sample}.log")
    threads: 32
    params:
	# list of trimmers (see manual)
        trimmer=["TRAILING:3"],
        # optional parameters
        extra="",
        compression_level="-9"
    resources:
	mem = 1000,
        time = 120
    message: """--- Trimming FASTQ files with Trimmomatic."""
    wrapper:
	"0.64.0/bio/trimmomatic/pe"

Results in the following error:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
	count   jobs
        1	qc_before_align_r1
        1

[Mon Sep 14 13:42:08 2020]
Job 0: --- Quality check of raw data with FastQC before alignment.

Activating conda environment: /home/moldach/wrappers/.snakemake/conda/975fb1fd
Activating conda environment: /home/moldach/wrappers/.snakemake/conda/975fb1fd
Skipping ' 2> logs/fastqc/before_align/MTG324_R1.log' which didn't exist, or couldn't be read
Failed to process file MTG324_R1_trim_paired.fastq.gz
uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Ran out of data in the middle of a fastq entry.  Your file is probably truncated
        at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:179)
        at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:125)
        at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:77)
        at java.base/java.lang.Thread.run(Thread.java:834)
mv: cannot stat ‘/tmp/tmpsnncjthh/MTG324_R1_trim_paired_fastqc.html’: No such file or directory
Traceback (most recent call last):
  File "/home/moldach/wrappers/.snakemake/scripts/tmpp34b98yj.wrapper.py", line 47, in <module>
    shell("mv {html_path:q} {snakemake.output.html:q}")
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/shell.py", line 205, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail;  mv /tmp/tmpsnncjthh/MTG324_R1_trim_paired_fastqc.html qc/fastQC/before_align/MTG324_R1_trim_paired_fastqc.html' returned non-zero exit status $
[Mon Sep 14 13:45:16 2020]
Error in rule qc_before_align_r1:
    jobid: 0
    output: qc/fastQC/before_align/MTG324_R1_trim_paired_fastqc.html, qc/fastQC/before_align/MTG324_R1_trim_paired_fastqc.zip
    log: logs/fastqc/before_align/MTG324_R1.log (check log file(s) for error message)
    conda-env: /home/moldach/wrappers/.snakemake/conda/975fb1fd

RuleException:
CalledProcessError in line 181 of /home/moldach/wrappers/Trim:
Command 'source /home/moldach/anaconda3/bin/activate '/home/moldach/wrappers/.snakemake/conda/975fb1fd'; set -euo pipefail;  python /home/moldach/wrappers/.snakemake/scripts/tmpp34b98yj.wrapper.py' retu$
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2189, in run_wrapper
  File "/home/moldach/wrappers/Trim", line 181, in __rule_qc_before_align_r1
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 529, in _callback
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/concurrent/futures/thread.py", line 57, in run
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 515, in cached_or_run
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2201, in run_wrapper
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

bwa index and mem potential path bugs

Snakemake version
5.10.0

Describe the bug
For each of bwa index and bwa mem the user might put in a prefix that will work for a local job, but fail given a remote. Examples are provided below:

Bwa Mem

For bwa mem, note that the params has an "index" below that is used to specify the path. The first below I don't think will work:

rule map_reads:
    input:
        reads=get_merged,
        idx=rules.bwa_index.output
    output:
        temp("mapped/{sample}.sorted.bam")
    log:
        "logs/bwa_mem/{sample}.log"
    params:
        index="refs/genome",
        extra=get_read_group,
        sort="samtools",
        sort_order="coordinate"
    threads: 8
    wrapper:
        "0.39.0/bio/bwa/mem"

But this would:

rule map_reads:
    input:
        reads=get_merged,
        idx=rules.bwa_index.output
    output:
        temp("mapped/{sample}.sorted.bam")
    log:
        "logs/bwa_mem/{sample}.log"
    params:
        index="snakemake-testing/kim-wxs-varlociraptor/refs/genome",
        extra=get_read_group,
        sort="samtools",
        sort_order="coordinate"
    threads: 8
    wrapper:
        "0.39.0/bio/bwa/mem"

Unlike bwa index (discussed next) I'm not sure we can just remove this one.

Bwa Index

For this prefix, given running on a remote with this recipe:

rule bwa_index:
    input:
        "refs/genome.fasta"
    output:
        multiext("refs/genome", ".amb", ".ann", ".bwt", ".pac", ".sa")
    params:
        prefix="refs/genome"
    log:
        "logs/bwa_index.log"
    resources:
        mem_mb=6000,disk_mb=128000
    benchmark:
        "benchmarks/bwa_index.tsv"
    wrapper:
        "0.45.1/bio/bwa/index"

The error log will report that bwa/genome.pac cannot be found. It's not the inputs or outputs, but rather that prefix is used to determine the path to the file! The correct usage would be:

rule bwa_index:
    input:
        "refs/genome.fasta"
    output:
        multiext("refs/genome", ".amb", ".ann", ".bwt", ".pac", ".sa")
    params:
        prefix="snakemake-testing/kim-wxs-varlociraptor/refs/genome"
    log:
        "logs/bwa_index.log"
    resources:
        mem_mb=6000,disk_mb=128000
    benchmark:
        "benchmarks/bwa_index.tsv"
    wrapper:
        "0.45.1/bio/bwa/index"

But rather we aren't required to specify it, so even better would be to remove it entirely:

rule bwa_index:
    input:
        "refs/genome.fasta"
    output:
        multiext("refs/genome", ".amb", ".ann", ".bwt", ".pac", ".sa")
    log:
        "logs/bwa_index.log"
    resources:
        mem_mb=6000,disk_mb=128000
    benchmark:
        "benchmarks/bwa_index.tsv"
    wrapper:
        "0.45.1/bio/bwa/index"

Of course the user writing the pipeline might not know this, in which case maybe there should be a fix to allow for a prefix specified that would, given a default remote prefix, add it as well?

For both of the above, if I can get started on work to fix the wrappers and then do a PR here, I'd be happy to do that! I'm not sure if there are other cases like this too in the wrappers. Let me know your thoughts.

Picard Createsequencedictionary

Hi,
the wrapper for PICARD CREATESEQUENCEDICTIONARY creates an unfinished *.dict file.
Do I need to attach any additional parameters for a correct *.dict file?
Many thanks.
Best regards,
Raphael

gatk variantrecalibrator wrapper does not specify properly the resources path

Snakemake version
Snakemake v5.31.1
Wrapper 0.68.0/bio/gatk/variantrecalibrator

Describe the bug
For VariantRecalibrator, I have my own path for the vcf files of the resources. When running the rule, I get an error from GATK that the resource does not exist. Checking the wrapper's code, it seems that there is a syntax error when returning the resources path. For GATK v4.1.1, the ":" in the parameters is removed according to:
https://gatk.broadinstitute.org/hc/en-us/community/posts/360072126112-Variant-Recalibrator-Couldn-t-read-file

Logs
Snakemake
RuleException:
CalledProcessError in line 52 of /home/VITO/correara/genomics/rules/filtering.smk:
Command 'source /home/VITO/correara/miniconda3/bin/activate '/home/VITO/correara/genomics/.snakemake/conda/d509627e'; set -euo pipefail; python /home/VITO/correara/genomics/.snakemake/scripts/tmpz9cha92t.wrapper.py' returned non-zero exit status 1.
File "/home/VITO/correara/miniconda3/envs/BioSnake/lib/python3.9/site-packages/snakemake/executors/init.py", line 2317, in run_wrapper
File "/home/VITO/correara/genomics/rules/filtering.smk", line 52, in __rule_recalibrate_calls
File "/home/VITO/correara/miniconda3/envs/BioSnake/lib/python3.9/site-packages/snakemake/executors/init.py", line 566, in _callback
File "/home/VITO/correara/miniconda3/envs/BioSnake/lib/python3.9/concurrent/futures/thread.py", line 52, in run
File "/home/VITO/correara/miniconda3/envs/BioSnake/lib/python3.9/site-packages/snakemake/executors/init.py", line 552, in cached_or_run
File "/home/VITO/correara/miniconda3/envs/BioSnake/lib/python3.9/site-packages/snakemake/executors/init.py", line 2348, in run_wrapper

GATK
Using GATK jar /home/VITO/correara/genomics/.snakemake/conda/d509627e/share/gatk4-4.1.4.1-1/gatk-package-4.1.4.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/VITO/correara/genomics/.snakemake/conda/d509627e/share/gatk4-4.1.4.1-1/gatk-package-4.1.4.1-local.jar VariantRecalibrator --max-gaussians 4 --resource hapmap,known=false,training=true,truth=true,prior=15.0:/home/VITO/correara/genomics/hapmap/hapmap_3.3.hg38.vcf.gz --resource omni,known=false,training=true,truth=false,prior=12.0:/home/VITO/correara/genomics/omni/1000G_omni2.5.hg38.vcf.gz --resource g1k,known=false,training=true,truth=false,prior=10.0:/home/VITO/correara/genomics/g1k/1000G_phase1.snps.high_confidence.hg38.vcf.gz --resource dbsnp,known=true,training=false,truth=false,prior=2.0:/home/VITO/correara/genomics/dbsnp/hg38_dbsnp138.vcf.gz -R resources/hg38/hg38.fa -V filtered/ERR032031.indels.vcf.gz -mode INDEL --output filtered/ERR032031.indels.recalibrated.vcf.gz --tranches-file filtered/ERR032031.indels.tranches -an QD -an FS -an ReadPosRankSum -an MQRankSum -an SOR -an DP
13:58:07.143 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/VITO/correara/genomics/.snakemake/conda/d509627e/share/gatk4-4.1.4.1-1/gatk-package-4.1.4.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jan 09, 2021 1:58:07 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
13:58:07.421 INFO VariantRecalibrator - ------------------------------------------------------------
13:58:07.421 INFO VariantRecalibrator - The Genome Analysis Toolkit (GATK) v4.1.4.1
13:58:07.422 INFO VariantRecalibrator - For support and documentation go to https://software.broadinstitute.org/gatk/
13:58:07.422 INFO VariantRecalibrator - Executing as correara@dev01 on Linux v4.4.0-198-generic amd64
13:58:07.422 INFO VariantRecalibrator - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_265-b11
13:58:07.422 INFO VariantRecalibrator - Start Date/Time: January 9, 2021 1:58:07 PM CET
13:58:07.422 INFO VariantRecalibrator - ------------------------------------------------------------
13:58:07.422 INFO VariantRecalibrator - ------------------------------------------------------------
13:58:07.422 INFO VariantRecalibrator - HTSJDK Version: 2.21.0
13:58:07.422 INFO VariantRecalibrator - Picard Version: 2.21.2
13:58:07.422 INFO VariantRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:58:07.422 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:58:07.422 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:58:07.423 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:58:07.423 INFO VariantRecalibrator - Deflater: IntelDeflater
13:58:07.423 INFO VariantRecalibrator - Inflater: IntelInflater
13:58:07.423 INFO VariantRecalibrator - GCS max retries/reopens: 20
13:58:07.423 INFO VariantRecalibrator - Requester pays: disabled
13:58:07.423 INFO VariantRecalibrator - Initializing engine
13:58:08.124 INFO VariantRecalibrator - Shutting down engine
[January 9, 2021 1:58:08 PM CET] org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=627572736


A USER ERROR has occurred: Couldn't read file file:///home/VITO/correara/genomics/hapmap,known=false,training=true,truth=true,prior=15.0:/home/VITO/correara/genomics/hapmap/hapmap_3.3.hg38.vcf.gz. Error was: It doesn't exist.


Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

Minimal example

rule recalibrate_calls:
      input:
          vcf = "filtered/{sample}.{vartype}.vcf.gz",
          ref = "resources/hg38/hg38.fa",
          hapmap = "/home/VITO/correara/genomics/hapmap/hapmap_3.3.hg38.vcf.gz",
          omni = "/home/VITO/correara/genomics/omni/1000G_omni2.5.hg38.vcf.gz",
          g1k = "/home/VITO/correara/genomics/g1k/1000G_phase1.snps.high_confidence.hg38.vcf.gz",
          dbsnp = "/home/VITO/correara/genomics/dbsnp/hg38_dbsnp138.vcf.gz",
      output:
          vcf = temp("filtered/{sample}.{vartype}.recalibrated.vcf.gz"),
          tranches = "filtered/{sample}.{vartype}.tranches",
          rscript = "filtered/{sample}.{vartype}.recal.plots.R"
      params:
          mode = get_mode,
          resources = {"hapmap": {"known": False, "training": True, "truth": True, "prior": 15.0},
                   "omni":   {"known": False, "training": True, "truth": False, "prior": 12.0},
                   "g1k":   {"known": False, "training": True, "truth": False, "prior": 10.0},
                   "dbsnp":  {"known": True, "training": False, "truth": False, "prior": 2.0}},
          annotation = ["QD", "FS", "ReadPosRankSum", "MQRankSum", "SOR", "DP"],
          extra = get_gaussians
      log:
          "logs/gatk.{sample}.{vartype}.variantrecalibrator.log"
      wrapper:
          "0.68.0/bio/gatk/variantrecalibrator"

Could it be possible to fix the wrapper with the updated GATK syntax?

bowtie2_build wrapper produces wrong output name for large index

Snakemake version
snakemake 5.30.1
wrapper:
"0.68.0/bio/bowtie2/build"

Describe the bug
When running a large index through the bowtie2_build, wrapper appends extra "l" in the middle of the index base name which results in output like:
bowtie2/index/rep82l.1.bt2l SHOULD BE bowtie2/index/rep82.1.bt2l
bowtie2/index/rep82l.2.bt2l SHOULD BE bowtie2/index/rep82.2.bt2l
....
....
....
bowtie2/index/rep82l.rev.2.bt2l SHOULD BE bowtie2/index/rep82.rev.2.bt2l

which further results in MissingOutputException as output is different from expected.

Logs
MissingOutputException in line 77 of /klaster/scratch/kkopera/COVID/microbiome/snakemake/QC_and_FT/Snakefile:
Job Missing files after 5 seconds:
bowtie2/index/rep82.1.bt2l
bowtie2/index/rep82.2.bt2l
bowtie2/index/rep82.3.bt2l
bowtie2/index/rep82.4.bt2l
bowtie2/index/rep82.rev.1.bt2l
bowtie2/index/rep82.rev.2.bt2l
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 4 completed successfully, but some output files are missing.

Minimal example

rule bowtie2_build:
    input:
        reference="prebuilt_data/rep82.fna"
    output:
        multiext(
            "bowtie2/index/rep82",
            ".1.bt2l", ".2.bt2l", ".3.bt2l", ".4.bt2l", ".rev.1.bt2l", ".rev.2.bt2l",
        ),
    log:
        "logs/bowtie2_build/build.log"
    params:
        extra="--large-index"  # optional parameters
    threads: 10
    wrapper:
        "0.68.0/bio/bowtie2/build"

Could it be possible to fix this bug as it prevents from using the wrapper for large index?

Remove of sort pipe in bwa wrappers

Is your feature request related to a problem? Please describe.
Several bwa wrappers include a pipe to convert to bam or to sort:
bwa/mem
bwa/mem-samblaster
bwa/samse
bwa/sampe
bwa-mem2/mem
bwa-mem2/mem-samblaster

I believe this is not needed, does not make sense and is, sometimes, inconvenient.

Not needed since the pipe can be easily implemented in snakemake with other rules/wrappers. Since these are actually wrappers of two wrappers, shouldn't they actually be meta wrappers? And inconvenient when you actually want a sam file (and not a bam).

Describe the solution you'd like
Remove the optional sorting from all bwa wrappers and have them output a sam file. Optionally, they could be re-implemented as meta-wrappers.
The same would apply to bwa/mem-samblaster and bwa-mem2/mem-samblaster.

markduplicates requires log file

Snakemake version
Snakemake - 5.6.0
Snakemake wrapper - 0.47.0

Describe the bug
Rule markduplicates uses {snakemake.log} directly in file, instead of utilizing snakemake.log_fmt_shell. This necessiates the use of file in log: directive and its absence results in job ending in error.

Relevant code:

shell(
"picard MarkDuplicates {snakemake.params} INPUT={snakemake.input} "
"OUTPUT={snakemake.output.bam} METRICS_FILE={snakemake.output.metrics} "
"&> {snakemake.log}"
)

Logs

/bin/bash: -c: line 0: syntax error near unexpected token `newline'
/bin/bash: -c: line 0: `set -euo pipefail;  picard MarkDuplicates REMOVE_DUPLICATES=true  TMP_DIR=/tmp INPUT=a.sorted.bam OUTPUT=dedup/a.bam METRICS_FILE=dedup/a.metrics.txt &>'
Traceback (most recent call last):
  File "/up/.snakemake/scripts/tmp075fw1hj.wrapper.py", line 14, in <module>
    shell("picard MarkDuplicates {snakemake.params} INPUT={snakemake.input} "
  File "/mnt/snakemake/snakemake/shell.py", line 149, in __new__
    raise sp.CalledProcessError(retcode, cmd)

Minimal example

rule mark_duplicates:
    input:
        "a.sorted.bam"
    output:
        bam="dedup/a.bam",
        metrics="dedup/a.metrics.txt"
    log:
        "/logs/a.log"
    wrapper:
        "0.47.0/bio/picard/markduplicates"

Additional context
Add any other context about the problem here.

Association Testing Plink 1.90

Hello,

I'd like to contribute adding this tool.

  • Is there any relevant docs I should read? Else, Do you mind pointing me to relevant source code to read and adapt? I'll just go ahead with a fork and try to adapt any of the available wrappers.

I've already made my own pipeline with 3 rules: it starts with a target sequencing VCF file, makes all the necessary input files for plink using that and any column from CSV file with a continous variable (preprocessing). With these I make the association testing (with multiple testing correction by permutation) and finally output a nice HTML report (using Rmarkdown) with the "tsv" tables that Plink outputs.

  • Can I make a wrapper that in reality it's 3 rules?

Thanks in advance!

SnpEff - (unnecessary) Multiple BCFTools calls

Hi,

Snakemake version
Snakemake: bioconda:snakemake=5.10.0 (latest)

Snakemake-wrappers: 0.49.0 (latest)
However, the issue is raised since commit: d90abc3, which includes a modification of the file wrapper.py of SnpEff

Describe the bug

In the file bio/snpeff/wrapper.py, he lines 22 to 24 overloads the value of the variable incalls :

incalls = snakemake.input[0]
if incalls.endswith(".bcf"):
    incalls = "<(bcftools view {})".format(incalls)

However, last modifications includes bcftools in the final shell line, at line 40

shell(
    "(bcftools view {incalls} | "
    "snpEff {data_dir} {stats_opt} {csvstats_opt} {extra} "
    "{snakemake.params.reference} "
    "{outprefix} > {outcalls}) {log}"
)

So, with bcf files, the final command line is:

(bcftools view <(bcftools view {incalls}) | snpEff ...

Additional context
Additionally, this behaviour leads to the systematic use of BCFTools piped with SnpEff. This wrapper uses at least two threads while running (one for each tools).

Solution
Please remove the lines 23 and 24, since they are no longer useful.

Many thanks in advance.

Preferred way to submit a meta-wrapper based on PR wrappers?

Hi!
I've been submitting several wrappers for DADA2 recently and planned to submit a meta-wrapper as well. However, I wanted some advice on the preferred way to do so seeing that DADA2 wrappers are scattered in different branches. I see two options so far:

Option 1 I can merge all branches, squash the wrappers commits, code the meta wrapper and submit the PR. But the merges of so many branches might get dirty.
Option 2 Wait for the wrappers inclusion and built the meta-wrappers from the new state of the master branch.

I apologize if this message sounds like I want to speed up the reviews, it is not intended. I just want to know the preferred way to contribute to this awesome project!
Thanks in advance,
Charlie

Displaying the wrapper input and output in the documentation

Is your feature request related to a problem? Please describe.
I noticed that while the meta.yaml are well completed, the info within is not displayed in the documentation (even in v0.67.0), especially the description of input and output of the wrapper.

Describe the solution you'd like
Update the Jinja .rst template to print out these info.

Describe alternatives you've considered
I will do a PR with a suggestion of template.

Best,
Charlie

Improper quoting of log file in fastqc wrapper

Snakemake version
snakemake version: 0.63.x onwards

Describe the bug
Log filepath gets improperly quoted for fastqc wrapper, and hence its logs won't be written into a file. Note that job using this wrapper will be successful (unless singularity container is used) but the log file will be empty.

Logs

$snakemake --use-conda -f qc/fastQC/before_trim/rep1_R1_fastqc.html -p

....
....

python /workflow_path/.snakemake/scripts/tmpsz98rhl8.wrapper.py
Activating conda environment:/workflow_path/.snakemake/conda/06b7292e
fastqc  --quiet -t 1 --outdir /scratch/local/tmpenzw80c2/workflow_path/rep1_R1.fastq.gz ' 2> logs/fastQC/rep1_R1.log'
Skipping ' 2> logs/fastQC/rep1_R1.log' which didn't exist, or couldn't be read
mv /scratch/local/tmpenzw80c2/rep1_R1_fastqc.html qc/fastQC/before_trim/rep1_R1_fastqc.html
mv /scratch/local/tmpenzw80c2/rep1_R1_fastqc.zip qc/fastQC/before_trim/rep1_R1_fastqc.zip
[Tue Sep 22 17:14:16 2020]
Finished job 0.
1 of 1 steps (100%) done
Complete log:/workflow_path/.snakemake/log/2020-09-22T171404.568908.snakemake.log

Minimal example
Try using 0.62.0 and 0.66.0 to replicate the issue. Former should be problem-free (assuming other dir/filepaths dont have whitespace character).

Additional context
Add any other context about the problem here.

"fastqc {snakemake.params} --quiet -t {snakemake.threads} "
"--outdir {tempdir:q} {snakemake.input[0]:q}"
" {log:q}"

In version 0.63.0, log directive was quoted using snakemake's :q feature, and this has the side effect of quoting not just the filepath but also the redirection symbol.

Container recipe for wrappers

Transferring issue from bitbucket.

Adrien Leger:

Conda env files are a great way to easily deploy software in wrappers, but sometimes the program you want is not on anaconda cloud or requires more complicated installation and setting steps.

What about having the possibility to use singularity/docker recipes instead to auto deploy a wrapper ?

Johannes:

Yes, that indeed won’t hurt. But the container images have to come from somewhere where sustainability is guaranteed, e.g. biocontainers. Snakemake wrapper implementation needs a minor extension such that it also searches for singularity/docker image definitions in the wrapper repo. I will put this on my TODO list.

Adrien:

Awesome
An synthax to use a local file, similar to what is currently available for a conda recipe would be nice as well, at least for dev ?
Thanks

Johannes:

I think it would make sense to simply put a container URL into the meta.yaml.

Problem with the VEP ANNOTATE snakemake wrapper

Snakemake version

Snakemake

5.23.0

Wrapper

Issue

I'm trying to adapt my regular VEP code to use the snakemake wrapper instead but am running into an issue.

I want to make sure that a) the wrapper works for me and b) it produces the same results as the following:

vep \
        -i {input.sample} \
        --species "caenorhabditis_elegans" \
        --format "vcf" \
        --everything \
        --offline \
        --force_overwrite \
        --fasta {input.ref} \
        --gff {input.annot} \
        --tab \
        --variant_class \
        --regulatory \
        --show_ref_allele \
        --numbers \
        --symbol \
        --protein \
        -o {params.sample}

In order to use VEP with wrappers there are 3 different rules.

I have got the following two working:

# VEP Download Plugins
rule download_vep_plugins:
    output:
        directory("resources/vep/plugins")
    params:
        release=100
    wrapper:
        "0.64.0/bio/vep/plugins"

# VEP Cache
rule get_vep_cache:
    output:
        directory("resources/vep/cache")
    params:
        species="caenorhabditis_elegans",
        build="WBcel235",
        release="100"
    log:
        "logs/vep/cache.log"
    wrapper:
        "0.64.0/bio/vep/cache"

The third rule is to actually run VEP for which I've written the following rule:

rule variant_annotation:
    input:
        calls= lambda wildcards: getVCFs(wildcards.sample),
        cache="resources/vep/cache",
        plugins="resources/vep/plugins",
    output:
        calls=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.annotated.vcf"),
        stats=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.html")
    params:
        plugins=["LoFtool"],
        extra="--everything"
    message: """--- Annotating Variants."""
    resources:
        mem = 30000,
        time = 120
    threads: 4
    wrapper:
        "0.64.0/bio/vep/annotate"

When submitting the job this is the error I receive:

Building DAG of jobs...
Using shell: /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       variant_annotation
        1

[Tue Aug 25 11:23:04 2020]
Job 0: --- Annotating Variants.

Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Failed to open VARIANT_CALLING/varscan/470_sorted_dedupped_snp_varscan.vcf: unknown file type
Possible precedence issue with control flow operator at /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Traceback (most recent call last):
  File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpm4v6gdij.wrapper.py", line 44, in <module>
    "(bcftools view {snakemake.input.calls} | "
  File "/home/moldach/bin/snakemake/lib/python3.8/site-packages/snakemake/shell.py", line 156, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail;  (bcftools view VARIANT_CALLING/varscan/470_sorted_dedupped_snp_varscan.vcf | vep --everything --fork 4 --format vcf --vcf --cache --cache_version 100 --species caenorhabditis_elegans --assembly WBcel235 --dir_cache resources/vep/cache --dir_plugins resources/vep/plugins --offline --plugin LoFtool --output_file STDOUT --stats_file ANNOTATION/VEP/470.html | bcftools view -Ov > ANNOTATION/VEP/470.annotated.vcf)' returned non-zero exit status 1.
[Tue Aug 25 11:25:02 2020]
Error in rule variant_annotation:
    jobid: 0
    output: ANNOTATION/VEP/470.annotated.vcf, ANNOTATION/VEP/470.html
    conda-env: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f

RuleException:
CalledProcessError in line 393 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile:
Command 'source /home/moldach/miniconda3/bin/activate '/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f'; set -euo pipefail;  python /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpm4v6gdij.wrapper.py' returned non-zero exit status 1.
  File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile", line 393, in __rule_variant_annotation
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.8.0/lib/python3.8/concurrent/futures/thread.py", line 57, in run
Removing output files of failed job variant_annotation since they might be corrupted:
ANNOTATION/VEP/470.annotated.vcf, ANNOTATION/VEP/470.html
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

This issue was originally brought up on StackOverflow, then on the ensembl-vep repo but seems it should be posted here for a definitive answer.

Add a joint-wrapper which parallelizes SAMTOOLS MPILEUP/VARSCAN MPILEUP2SNP

There are currently two separate wrappers for SAMTOOLS MPILEUP and VARSCAN MPILEUP2SNP

These tools are used sequentially and unfortunately single-threaded.

I'm in the process of converting shell commands to wrappers so I have not had a chance to benchmark these wrappers specifically; however, I assume it is the same as piping the output of samtools mpileup into varscan, e.g.:

samtools mpileup -f $ref ${array_bam[$j]{"bam"}} |\\
    java -jar /home/${user}/projects/def-mtarailo/common/tools/VarScan.v2.3.9.jar pileup2snp \\
    --variants \\
    --min-coverage $mincov \\
    --min-avg-qual $minqual \\
    --min-var-freq $minfreq > ${path}/calling/varscan/${array_bam[$j]{"id"}}_snp_varscan.vcf ;

When I initially benchmarked the above code on C. elegans it took 101 minutes.

It would be ideal to create a joint-wrapper, combining the two tools, taking advantage of samtools mpileup's --region parameter and GNU Parallel.

The shell command I'm currently using is:

rule variant_calling:
    input:
        ref = os.path.join(dirs_dict["REF_DIR"],config["REF_GENOME"]),
        bam = lambda wildcards: getDeduppedBams(wildcards.sample),
        bam_index = lambda wildcards: getDeduppedBamsIndex(wildcards.sample)
    output:
        os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_sorted_dedupped_snp_varscan.vcf")
    log: os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}.log")
    resources:
        mem = 10000,
        time = 90
    threads: 7
    params:
        sample = "{sample}"
    message: """--- Varscan pileup2snp."""
    shell: """
        module load samtools;
        module load java/13.0.1;
        echo -n "I II III IV V X MtDNA" |\
        xargs -d " " -n 1 -P 7 -I {{}} /bin/bash -c \
        "samtools mpileup -r {{}} \
        -f ~/projects/def-mtarailo/common/indexes/WS265_wormbase/{{}}.fa \
        {input.bam} |\
        java -Xmx5G -jar ~/projects/def-mtarailo/common/tools/VarScan.v2.3.9.jar pileup2snp \
        --variants \
        --min-coverage 5 \
        --min-avg-qual 30 \
        --min-var-freq 0.9 > {params.sample}_{{}}.vcf"
        awk 'FNR==1 && NR!=1 {{ while (/^<header>/) getline; }} 1 {{print}} ' *.vcf > {output}

        rm {params.sample}_I.vcf {params.sample}_II.vcf {params.sample}_III.vcf {params.sample}_IV.vcf {params.sample}_V.vcf {params.sample}_X.vcf {params.sample}_MtDNA.vcf
        """

This parallelized the variant calling process by applying these operations to each chromosome (on a separate core) reducing computation time to 17 minutes - A 81% decrease in processing time for the lengthiest step in the C. elegans pipeline.

GitHub Actions failure `NoSpaceLeftError: No space left on devices.` with unicycler conda environment creation

Snakemake version
snakemake version: 5.26.1
wrapper version: not yet released, current version on master branch

Describe the bug
The tests on the pull request for unicycler's addition ran fine (and quick, a few minutes):
https://github.com/snakemake/snakemake-wrappers/runs/1342977411#step:7:265

Since it has been merged into master, tests on master are consistently failing with the following error (after running for hours):

Creating conda environment /tmp/tmprc8d9w2k/master/bio/unicycler/environment.yaml...
Downloading and installing remote packages.
CreateCondaEnvironmentException:
Could not create conda environment from /tmp/tmprc8d9w2k/master/bio/unicycler/environment.yaml:
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Preparing transaction: ...working... done
Verifying transaction: ...working... failed

NoSpaceLeftError: No space left on devices.

See here:
https://github.com/snakemake/snakemake-wrappers/runs/1343943714#step:7:321

Logs
If applicable, any terminal output to help explain your problem.

Minimal example
See GitHub Actions results of recent merge commits.

Strelk2 bug: unsupported pickle protocol:3

Snakemake version
5.24.2

Describe the bug

I'm getting an error when using the Strelka2 wrapper when using the --use-singularity and --use-conda commands in conjunction but not when --use-conda is used alone.

I'm wondering if this is problem with the wrapper or the singularity image?
If it's the image can you suggest a better one to use please

Logs

Workflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Job counts:
	count   jobs
        1	strelka2
        1

[Thu Sep 24 17:16:03 2020]
Job 0: --- Call germline variants with Strelka2.

python /home/moldach/wrappers/SUBSET/.snakemake/scripts/tmpoc6_0bnr.wrapper.py
Activating singularity image /home/moldach/wrappers/SUBSET/.snakemake/singularity/d7617773b315c3abcb29e0484085ed06.simg
Activating conda environment: /home/moldach/wrappers/SUBSET/.snakemake/conda/1995398f
Traceback (most recent call last):
  File "/home/moldach/wrappers/SUBSET/.snakemake/scripts/tmpoc6_0bnr.wrapper.py", line 3, in <module>
    x00sampleqhK\x00N\x86qish\x15]qj(h\x17h\x18eh\x17h\x19h\x1e\x85qkRql(h\x1e)}qmh"h\x17sNtqnbh\x18h\x19h\x1e\x85qoRqp(h\x1e)}qqh"h\x18sNtqrbX\x06\x00\x00\x00sampleqsheubX\x07\x00\x00\x00threadsqtK\x08$
  File "/home/moldach/wrappers/SUBSET/.snakemake/conda/1995398f/lib/python2.7/pickle.py", line 1388, in loads
    return Unpickler(file).load()
  File "/home/moldach/wrappers/SUBSET/.snakemake/conda/1995398f/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "/home/moldach/wrappers/SUBSET/.snakemake/conda/1995398f/lib/python2.7/pickle.py", line 892, in load_proto
    raise ValueError, "unsupported pickle protocol: %d" % proto
ValueError: unsupported pickle protocol: 3
[Thu Sep 24 17:16:18 2020]
Error in rule strelka2:
    jobid: 0
    output: strelka/MTG324
    log: logs/bowtie2/MTG324.log (check log file(s) for error message)
    conda-env: /home/moldach/wrappers/SUBSET/.snakemake/conda/1995398f

RuleException:
CalledProcessError in line 453 of /home/moldach/wrappers/SUBSET/Snakefile:
Command ' singularity exec --home /home/moldach/wrappers/SUBSET  --bind /home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages:/mnt/snakemake /home/moldach/wrappers/SUBSET/.snakemake/singula$
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2189, in run_wrapper
  File "/home/moldach/wrappers/SUBSET/Snakefile", line 453, in __rule_strelka2
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 529, in _callback
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/concurrent/futures/thread.py", line 57, in run
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 515, in cached_or_run
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2201, in run_wrapper
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

Minimal example

Singularity container

singularity: "docker://continuumio/miniconda3:4.5.11"

Rule

def getDeduppedBamsIndex(sample):
  return(list(os.path.join(aligns_dict[sample],"{0}.sorted.dedupped.bam.bai".format(sample,pair)) for pair in ['']))

if (config["CALLING_TOOL"]=="strelka2"):
        rule strelka2:
            input:
                    fasta=os.path.join(dirs_dict["REF_DIR"],config["REF_GENOME"]),
                    bam=lambda wildcards: getDeduppedBams(wildcards.sample),
                    bam_index=lambda wildcards: getDeduppedBamsIndex(wildcards.sample),
                    fasta_index=os.path.join(dirs_dict["REF_DIR"],GENOME_INDEX)
            output:
                    temp(directory("strelka/{sample}"))
            log: os.path.join(dirs_dict["LOG_DIR"],config["ALIGN_TOOL"],"{sample}.log")
            message: """--- Call germline variants with Strelka2."""
            threads: 8
            resources:
                    mem=4000,
                    time=100
            params:
                    config_extra="",
                    run_extra=""
            wrapper:
                    "0.65.0/bio/strelka/germline"
            """


bwa and bwa-mem2 index could infer prefix from output

Is your feature request related to a problem? Please describe.
At least in bwa and bwa-mem2 index, the destination (prefix) could be directly inferred from the output, being unnecessary to specify the prefix param.

Describe the solution you'd like
The wrapper would infer the prefix from snakemake.output[0]

@Smeds do you think it makes sense?

Parental leave until end of February, please be patient :-)

Hi folks, I am on parental leave until end of February.
Hence, I won't have the chance to look into your bug reports and PRs until then.

Of course, you are all invited to do mutual reviews :-)!

Thanks a lot for your patience.
Johannes

Not possible to download latest versions of GRCh37

Snakemake version
snakemake: 5.9.1
snakemake-wrappers: 0.45.1

Describe the bug
It's not possible to download latest versions of GRCh37 build (releases up from 76 on). These are at a different location relative to the main releases.
ftp://ftp.ensembl.org/pub/grch37/release-98/fasta/homo_sapiens/dna/ and ftp://ftp.ensembl.org/pub/release-98/fasta/homo_sapiens/dna/ respectively.

Logs
ValueError: Requested sequence does not seem to exist on ensembl FTP servers or servers are unavailable (url ftp://ftp.ensembl.org/pub/release-98/fasta/homo_sapiens/dna/Homo_sapiens.grch37.dna.toplevel.fa.gz)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.