lauringlab / variant_pipeline Goto Github PK

Work on the variant_pipeline and initial r analysis used in calling variants from NGS data

License: Apache License 2.0

Shell 26.09% R 5.54% HTML 12.86% Python 47.22% Groovy 8.29%

variant_pipeline's Introduction

Lauring Variant Pipeline

Nominates candidate variants by comparing the sequences in a test sample to those found in a plasmid control. The Pipeline runs as one phase which takes in fastq files and outputs putative variants as well as all base call above a set frequency. It is then up to the user to filter the putative variants based on the characteristics provided.

Directory list

bin
- variantPipeline.py : a python wrapper that runs a provided bpipe pipeline
- variant_pipeline.pbs : an example pbs script used to implement the Pipeline
doc
- workflow diagram, examples
lib
- supporting libraries - picard and bpipe live here
packrat
- The R dependencies are listed in the lock file. Also the packages will be downloaded here on set up
scripts _ supporting scripts (bpipe, python, R) _ some of these are old and not used others are helpful for downstream analysis but are not used by the current stages
- most have been written to provide a useage message when called with the -h flag
test
- automated tests (mostly python testing the python pipeline)
tutorial
- The directories and instructions needed to run the tutorial. Instructions can be found in the tutorial readme

bin/variantPipeline.py

This script is a thin python wrapper that takes in a bpipe pipeline, input files, output directory and an options yaml. Whenever this is launched, the bpipe scripts are copied from the scripts directory and stored in the output directory as a log of what was run. the output directory will be made if it doesn't exist.

Usage: python variantPipeline.py -h

See the tutorial for more information.

*NOTE: Your fasta is used in the variant calling step and needs to end in .fa*

Outputs

There are 3 main pipelines that can be run. All of the stages for the pipelines are held in ./scripts/variantPipeline.bpipe.stages.groovy

Basic alinging scripts/aligning_pipeline.groovy

cutadapt
- the trimmed fastq files - these are trimmed based on NEBnext primers which is hard coded in the stage
fastqc
- fastqc data on samples
align
- The aligned bam and sorted sam files
removed_duplicated
- bam files with duplicate reads removed

DeepSNV pipeline scripts/deepsnv_pipeline.groovy

Runing this pipeline after the one above is the same as the old single pipeline.

deepSNV
- csv summary files, coverage files and fasta files from deepSNV
parsed_fa
- deepSNV outputs a concatenated fasta file. The parsed ones are here.
Variants
- csv files containing all variants and additional qualty data about each one. (Mapq, phred, read position ect.)
Filter Variants
- csv files containing variants that meet quality thresholds
Final Variants
- csv files containing variants that meet quality thresholds including amino acid information

python pipeline to call all variants and sequencing errors scripts/python_pipeline.groovy

consensus
- The consesus seqeunce of each sample
position-stats
- JSON files with all bases called at every position including amino acid designation

Dependencies

Note : Flux is the name of the computing core used by our lab at the Univeristy of Michigan. Some of the directions may be specific to those working on this platform

The pipeline comes with many of the required programs (bpipe and pycard); however, bowtie2, samtools and certain R and python libraries are required by the variant calling.

To run these all pipelines you must have the java developer kit installed. It can be installed from here. If bpipe doesn't run this is the first place to start.

All the other depedencies, except R and the R packages, are handled by conda. Install conda by following the tutorial here.

We can install the conda environment with the following command (run from the variant_pipeline/ directory)

conda env create -f scripts/environment.yml

We have to activate the environment before running the commands below.

conda activate variant-pipeline

On flux we can achieve an equivalent environment by loading the following modules

module load muscle
module load bowtie2
module load python-anaconda2/201704
module load fastqc
module load R

The R modules are managed by packrat. I am using R 3.5.0. From the main directory run

R
packrat::restore()

to download the needed dependencies. They should be placed the packrat/lib directory. This is important since the R script will look for them there. You may need to install packrat first if you don't have it.

Adapted and developed by JT McCrone based on work done by Chris Gates/Peter Ulintz UM BCRCF Bioinformatics Core

variant_pipeline's People

Contributors

Stargazers

Watchers

Forkers

jjssantos jtmccr1 kmpeck virgiliocyte jshoyer elifesciences-publications andrewvalesano hpcbio alauring jgpetrie jinglu1982 bendalle sywan

variant_pipeline's Issues

failed to install some R packages with packrat

First of all, I am installing the pipeline on a cluster to my local space as I am a regular user rather than someone with root privileges.
Some R packages like Rhtslib are failing to install and the issue seems to be that packrat cannot direct the installation to happen on my local path.
This problem has been known and discussed in places like this one:
rstudio/packrat#418

Wouldn't it make more sense to use BiocManager in order to install that long list of R libraries that this pipeline requires?

adding results files to the tutorial for comparison

Hi,
Would it be possible for you to add results files that correspond to the test data set you made available here?

I finished a successful run of this pipeline and I would like to compare my results with yours.

Thanks much in advance.

--Gloria

deepSNV.r generates output.csv files with different column order - why?

I run deepSNV.r three times and I only changed one parameter in each case.
The order of the columns of the three output.csv files is different.

`Rscript --vanilla --slave deepSNV_R wsn33_wt_plasmid.fa
HA-22_S7_L001.sorted.dedup.filtered.bam
Plasmid-control_S49_L001.sorted.dedup.filtered.bam
bonferroni
0.1
fisher
one.sided
1.1
output.csv
output.fa
/.../src/R_lib

Rscript --vanilla --slave deepSNV_R wsn33_wt_plasmid.fa
HA-22_S7_L001.sorted.dedup.filtered.bam
Plasmid-control_S49_L001.sorted.dedup.filtered.bam
bonferroni
0.1
fisher
one.sided
0.1
output3.csv
output3.fa
/.../src/R_lib

`
The corresponding output files are these:

output.csv.txt
output3.csv.txt
output4.csv.txt

reciprocal_variants.py throws error

Hi,
I had a batch of 15 samples to analyze and I was able to obtain results for the majority of them.
However, there were four samples that caused the reciprocal_variants.py script to fail with this error. See below.

Traceback (most recent call last): File "src/variant_pipeline/scripts/reciprocal_variants.py", line 130, in <module> var = get_reference(row,bam,"var") File "src/variant_pipeline/scripts/reciprocal_variants.py", line 94, in get_reference for pileupcolumn in bam.pileup(df.chr,py_pos,py_pos+1,truncate=True,stepper="all",max_depth=1E6): # look at the range py_pos to py_pos+1 and all the bases that align to those positions. Think of a column on each position like the samtools output of mpileup File "pysam/libcalignmentfile.pyx", line 1051, in pysam.libcalignmentfile.AlignmentFile.pileup (pysam/libcalignmentfile.c:12611) File "pysam/libcalignmentfile.pyx", line 774, in pysam.libcalignmentfile.AlignmentFile.parse_region (pysam/libcalignmentfile.c:10572) File "pysam/libcalignmentfile.pyx", line 1631, in pysam.libcalignmentfile.AlignmentFile.gettid (pysam/libcalignmentfile.c:18839) File "pysam/libcalignmentfile.pyx", line 689, in pysam.libcalignmentfile.AlignmentFile.get_tid (pysam/libcalignmentfile.c:9515) File "pysam/libcutils.pyx", line 122, in pysam.libcutils.force_bytes (pysam/libcutils.c:3081) TypeError: Argument must be string, bytes or unicode.

Have you seen this error before? What is the cause? how can it be corrected?

control.bam

Dear sir,
We used the Illumina platform to determine the whole genome consensus sequence and to
identify intrahost single nucleotide variants for each patient-derived sample
we want to using DeepSNV to identified single nucleotide variants (SNV)
which relies on a clonal control to estimate the，local error rate within a given sequence context and to identify strand bias in base calling. The
clonal control was a library prepared in an identical fashion from 8 plasmids containing the
genome for the respective circulating reference strain and sequenced in the same flow cell to
control for batch effects
But we don't know anything about ‘control’
Can you help us figure out what ‘control’ is and how to use ‘control’

Can't reproduce results of tutorial, too many variants with p.val=NA

Hi,
First of all, I could not get the pipeline to run "as is".
So, I run each step manually. See scriptlog at the end of this post.

My plots look similar to those of the tutorial.
However, when I get to the final part where the filtering analysis is demonstrated I noticed that the metrics of my variants do not match those of the tutorial. In particular, my p.val are all NAs for variants with Read_pos between 50 and 200.

I am puzzled by these discrepancies. Perhaps you @jtmccr1 can help me here.
Your input would be greatly appreciated. Thanks

===============================================================

tool provenance

module load MUSCLE/3.8.31-IGB-gcc-4.9.4
module load Bowtie2/2.3.2-IGB-gcc-4.9.4
module load SAMtools/1.7-IGB-gcc-4.9.4
module load Python/3.6.1-IGB-gcc-4.9.4
module load FastQC/0.11.5-IGB-gcc-4.9.4-Java-1.8.0_152
module load picard/2.10.1-Java-1.8.0_152
module load R/3.5.0-IGB-gcc-4.9.4

#alignment

bowtie2 --seed 42 --sensitive
-x ../../data/reference/wsn33_wt_plasmid
-1 ../../data/fastq/HA-22_S7_L001.1.1.fastq
-2 ../../data/fastq/HA-22_S7_L001.2.1.fastq
-S HA-22_S7_L001.sam 2> HA-22_S7_L001.align.log

bowtie2 --seed 42 --sensitive
-x ../../data/reference/wsn33_wt_plasmid
-1 ../../data/fastq/Plasmid-control_S49_L001.1.1.fastq
-2 ../../data/fastq/Plasmid-control_S49_L001.2.1.fastq
-S Plasmid-control_S49_L001.sam 2> Plasmid-control_S49_L001.align.log

filtering

samtools view -Shq 30 HA-22_S7_L001.sam > HA-22_S7_L001_filtered.sam
samtools view -Shq 30 Plasmid-control_S49_L001.sam > Plasmid-control_S49_L001_filtered.sam

mark and remove PCR duplicates

java -Xmx4g -Djava.io.tmpdir=./ -jar ${EBROOTPICARD}/picard.jar MarkDuplicates
INPUT=HA-22_S7_L001_filtered_sorted.bam
OUTPUT=HA-22_S7_L001_filtered_sorted_nodup.bam
REMOVE_DUPLICATES=true
CREATE_INDEX=true
METRICS_FILE=HA-22_S7_L001-picard.out.metrics
VALIDATION_STRINGENCY=LENIENT

java -Xmx4g -Djava.io.tmpdir=./ -jar ${EBROOTPICARD}/picard.jar MarkDuplicates
INPUT=Plasmid-control_S49_L001_filtered_sorted.bam
OUTPUT=Plasmid-control_S49_L001_filtered_sorted_nodup.bam
REMOVE_DUPLICATES=true
CREATE_INDEX=true
METRICS_FILE=Plasmid-control_S49_L001-picard.out.metrics
VALIDATION_STRINGENCY=LENIENT

call variants with deepSNV.R

This R script was not able to convert reference.fasta to regions.bed

error message:

could not find function "fasta.info"

So, I performed this step offline and then tweaked the script to read this input file instead

reference.bed <- "reference.bed"

test.bam <- "HA-22_S7.bam"

control.bam <- "Plasmid-control_S49.bam"

method <- "bonferroni"

p.cut <- "0.1"

p.com.meth <- "fisher"

disp <- "two.sided"

stringent_freq <- "0.5"

csv <- "results.csv"

fa <- "results.fa"

add metrics with mapq.py

This python script was not parsing the file correctly

Error message:

Traceback (most recent call last):

File "../../scripts/mapq.py", line 52, in

pos=int(line[1]) #The position relative to the chromosome

ValueError: invalid literal for int() with base 10: 'HA'

I rerun the deepSNV.R step and noticed that when I SKIPPED the step that adjusts the model

The output file DOES have the format that mapq.py expects.

So, I skipped it

However, the results of my final output file do NOT match those of the tutorial.

I see TOO many variants with p.val=NA

update tutorial

Update the tutorial to incorporate the changes in the hpc work flows