boevalab / oncocnv Goto Github PK

View Code? Open in Web Editor NEW

23.0 4.0 12.0 2.92 MB

ONCOCNV - a package to detect copy number changes in Targeted Deep Sequencing and Exome-seq data

Shell 5.22% Perl 26.90% R 67.87%

oncocnv's Introduction

ONCOCNV

ONCOCNV - a package to detect copy number changes in Deep Sequencing data

REQUIREMENTS

Perl and R installed and added to the PATH
E.g., export PATH=$PATH:YOURPATH/R/bin
SAMtools (http://samtools.sourceforge.net/) installed and added to the PATH
To add to PATH, type in the command line or add to "ONCOCNV.sh":
export PATH=$PATH:YOURPATH/samtools/bin
or
alias samtools=YOURPATH/samtools/bin/samtools
BEDTools (http://bedtools.readthedocs.org/en/latest/) installed and added to the PATH
To add to PATH, type in the command line or add to "ONCOCNV.sh":
export PATH=$PATH:YOURPATH/BEDTools/bin/ or
alias bedtools=YOURPATH/BEDTools/bin/bedtools
The following R libraries should be installed: MASS, mclust, PSCBS, DNAcopy, R.cache, scales, cwhmisc, fastICA, cghseg, digest
The fasta sequence (one file, unzipped; e.g. "hg19.fa") of the targeted genome should be downloaded from http://hgdownload.soe.ucsc.edu/downloads.html
You need to have your data aligned (.bam files)
You need to have at least three control files to construct a reliable baseline. However, ONCOCNV will run with only 2 controls starting from version 5.4 and with JUST one control starting from version 5.7. Yet, we recommend to have at least 3 control for good performance of the algorithm.

INSTALLATION

Download ONCOCNV.zip (or ONCOCNV.vX.X.zip)
Unzip files into detectory "scripts"

Check requirements (R + the necessary R packages must be installed) To install the necessary P packages (when R is installed), type in the command line:

R
install.packages("MASS")
install.packages("mclust")
install.packages("R.cache")
install.packages("scales")
install.packages("cwhmisc")
install.packages("fastICA")
install.packages("cghseg")
install.packages("digest")
source("http://bioconductor.org/biocLite.R")
biocLite("DNAcopy")
install.packages("PSCBS")
quit()

RUN ONCOCNV

Open "ONCOCNV.sh" with a text editor (gedit, textpad, etc.)
Set correct paths and filenames in the top part of the "ONCOCNV.sh"
Check properties of "ONCOCNV.sh"
chmod +rwx PathToONCOCNV/scripts/ONCOCNV.sh
Check formats: o reads should be given in .BAM format
o amplicon coordinates should be given in .bed format (with or without the headline) and have amplicon ID in column 4 and gene symbol in column 6, e.g.:
chr1 2488068 2488201 AMPL223847 0 TNFRSF14
It is mandatory to provide gene names in the 6th column.

VERY IMPORTANT

	Please make sure that:
-	There is no duplicates in the coordinates
-	Coordinates are sorted
-	Gene names are gene names in the sense that corresponding amplicons fall in the same genomic locus and not on different chromosomes
-	Gene names cannot be the same as amplicon names or IDs because ONCOCNV assumes to have several amplicons per gene

Run "ONCOCNV.sh" from the command line: cd PathToONCOCNV/scripts ./ONCOCNV.sh or . PathToONCOCNV/scripts/ONCOCNV.sh

HOW TO READ OUTPUT FILES

There are three output files per sample:

*.profile.png
- Visual representation of normalized and annotated copy number profile
  Each dot corresponds to an amplicon; the X-axis is not up to scale.
  Color code:
  o GREEN one-point-outlier
  o DARK GREY SURROUNDINGS frequent one-point-outlier
  o BROWN >1 level gain
  o BROWN SURROUNDINGS 1-level gain
  o BLUE >1 level loss
  o BLUE SURROUNDINGS 1-level loss
*.summary.txt
- predictions per gene

gene gene name
chr chromosome name
start first amplicon start
end last amplicon start
copy.number predicted copy number (no normal contamination nor subclones is taken into accout)
p.value p-value for the copy number status of the genomic region encompassing the gene
q.values q-value ("fdr"-corrected p-value) for the copy number status of the genomic region encompassing the gene
comments p-value for the hypothesis that the copy number of the gene does not match the copy number of the encompassing segment
(in the case of a break within a gene - it is the p-value for the break)

*.profile.txt
- predictions per amplicon (detailed information)

chr chromosome name
start first amplicon start
end last amplicon start
gene gene name
ID amplicon ID
ratio logarithm of the normalized read count (zero values correspond to the neutral copy number)
predLargeSeg copy number predicted by segmentation of normalized read counts
predLargeCorrected final prediction for the copy number
pvalRatioCorrected p-value of the t-test to test the difference between the normalized read counts and the value expected from the segmentation or from the gene-based copy number assessment
perGeneEvaluation copy number predicted per gene (unaware of the segementation)
pvalRatioGene gene-based p-value of the t-test for the difference of the mean of the normalized read counts from zero
predPoint predicted one-point-outlier
predPointSusp predicted (frequent) one-point-outlier
comments additional information:
SegRatio mean value of the logarithms of the normalized read counts per segment
AbsMeanSigma normalized difference of the mean value (~z-score/sqrt(#amplicons in the segment))
pvalue p-value for AbsMeanSigma
pvalueTTest p-value of the t-test (per segment) \

oncocnv's People

Contributors

Stargazers

Watchers

Forkers

arnoldliaoilmn vswilliamson marwoes lizardstarks xiaoqiwang19 xuwei684 cgh2 xuexiaohua-bio shuyikhongg dawidniec wook2014

oncocnv's Issues

various issues "span too small",

Error in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
span is too small
https://drive.google.com/open?id=0B4x0rK6clrNnWEZES0oxQmFwY3M

Error in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
'x' is singular: singular fits are not implemented in 'rlm'
https://drive.google.com/open?id=0B4x0rK6clrNnX0pCRWxhMVVNU1U

Raw data unavailable

Hi,

I was looking for the raw data infos as validation dataset, but the page is not available anymore:
https://github.com/BoevaLab/ONCOCNV/blob/master/testdata/Dataset_A_results_3tools.zip

Is there a way to retrieve raw fastq data for validation purposes?

Thanks a lot and have a nice day!

Valentina

Error in mvnX when running processSamples.R of OncoCNV

Hi,

I tried to run ONCOCNV.sh (most recent version) on a set of 9 normal sample alignments to generate a baseline to call CNVs on a set of 8 tumor samples. I used an appropriate tab-delimited BED file (chr, start, end, amplicon ID, 0, and gene name), BAM alignments, and an Ensembl reference genome. All tool requirements were satisfied. However, I was getting the following error with processSamples.R:

running processSamples.R
Package 'mclust' version 5.4.5
Type 'citation("mclust")' for citing this R package in publications.
PSCBS v0.65.0 successfully loaded. See ?PSCBS for help.

Attaching package: ‘PSCBS’

The following objects are masked from ‘package:base’:

append, load

R.cache v0.13.0 (2018-01-03) successfully loaded. See ?R.cache for help.
Loading required package: lattice
Loading required package: grid
Loading required package: parallel
Error in mvnX(data = data, prior = prior) :
NA/NaN/Inf in foreign function call (arg 1)
Calls: predictClust ... eval -> eval -> mclustBIC -> mvn -> eval -> eval -> mvnX
In addition: Warning messages:
1: In chrom != "chrM" & chrom != "chrX" & chrom != "chrY" & chrom != :
longer object length is not a multiple of shorter object length
2: In chrom != "chrM" & chrom != "chrX" & chrom != "chrY" & chrom != :
longer object length is not a multiple of shorter object length
3: In chrom != "chrM" & chrom != "chrX" & chrom != "chrY" & chrom != :
longer object length is not a multiple of shorter object length
4: In rnorm(length(autoInd), 0, sdError[autoInd]) : NAs produced
Execution halted

I want to mention that the Control.stats.Processed.txt file from processControls.R contained some fields with missing values (NA) for mean and IC1-3. Replacing those values with arbitrary numbers did not seem to help. Any clues as to what could be causing this error would be much appreciated.

Thank you!

no output from ONCOCNV_getCounts.pl

Hi,

since the shell script is not working I started running the commands individually. I noticed that tjere is no output file from ONCOCNV_getCounts.pl couldn't figure out what the error is. could advice me if this is due to the overlapping regions in the bed file. see below for the complete output on the terminal

Thanks
Ram
perl ~/biotools/ONCOCNV-6.9/src/ONCOCNV_getCounts.pl getControlStats -m Ampli -b bedfiles/ecmp_iWG_IAD166152_1.20181012.designed.bed -c bamfiles/N103_C01_NA12787cellline_S5.bam, bamfiles/N023_A6_LRS_1_092711_S4.bam -o Control.stats.txt
number of targeted regions: 6521
These two regions overlap by more than 75%: chr7 140481362 - 140481481 and 140481350 - 140481469
merged into BRAF_8.19268_1.212400.BRAF_8.19268_1.7277: chr7:140481350-140481481
These two regions overlap by more than 75%: chr7 128851476 - 128851602 and 128851509 - 128851617
merged into CHP2_SMO_5_1.91226.CHP2_SMO_5_1.92038: chr7:128851476-128851617
These two regions overlap by more than 75%: chr22 29091675 - 29091804 and 29091658 - 29091786
merged into SP_130.44364_1.226729.SP_130.44364_1.82460: chr22:29091658-29091804
These two regions overlap by more than 75%: chr19 41737069 - 41737200 and 41737086 - 41737215
merged into AXL_6.17481_1.352.AXL_6.17481_1.36514: chr19:41737069-41737215
These two regions overlap by more than 75%: chr19 11172374 - 11172470 and 11172397 - 11172524
merged into SMARCA4_35.28997_1.47562.SMARCA4_35.28997_1.98537: chr19:11172374-11172524
These two regions overlap by more than 75%: chr19 52716249 - 52716350 and 52716210 - 52716333
merged into SP_108.56083_1.148366.SP_108.56083_1.216597: chr19:52716210-52716350
These two regions overlap by more than 75%: chr19 4047877 - 4048018 and 4047910 - 4048049
merged into ZBTB7A_1.67873.ZBTB7A_1.81922: chr19:4047877-4048049
These two regions overlap by more than 75%: chr8 128752974 - 128753086 and 128753005 - 128753110
merged into MYC_3.179895_1.146699.MYC_3.179895_1.193715: chr8:128752974-128753110
These two regions overlap by more than 75%: chr1 11190805 - 11190931 and 11190789 - 11190899
merged into SP_8.22255_1.172662.SP_8.22255_1.60091: chr1:11190789-11190931
These two regions overlap by more than 75%: chr11 108216397 - 108216514 and 108216380 - 108216495
merged into ATM_4.34438_1.116103.ATM_4.34438_1.83519: chr11:108216380-108216514
These two regions overlap by more than 75%: chr17 56439828 - 56439952 and 56439850 - 56439975
merged into RNF43_5.17250_1.32609.RNF43_5.17250_1.80351: chr17:56439828-56439975
These two regions overlap by more than 75%: chr17 12011122 - 12011207 and 12011142 - 12011263
merged into SP_83.21899_1.100027.SP_83.21899_1.54849: chr17:12011122-12011263
These two regions overlap by more than 75%: chr16 89842097 - 89842208 and 89842117 - 89842235
merged into FANCA_23.53959_1.124713.FANCA_23.53959_1.8864: chr16:89842097-89842235
These two regions overlap by more than 75%: chr16 89807127 - 89807236 and 89807083 - 89807222
merged into FANCA_6.33590_1.121771.FANCA_6.33590_1.28197: chr16:89807083-89807236
These two regions overlap by more than 75%: chr3 37088968 - 37089093 and 37088982 - 37089106
merged into MLH1_16.37834_1.241717.MLH1_16.37834_1.77471: chr3:37088968-37089106
These two regions overlap by more than 75%: chr3 38181918 - 38182044 and 38181902 - 38182018
merged into SP_133.35672_1.195191.SP_133.35672_1.66065: chr3:38181902-38182044
These two regions overlap by more than 75%: chr15 66774080 - 66774199 and 66774093 - 66774211
merged into SP_75.51183_1.106366.SP_75.51183_1.212778: chr15:66774080-66774211
These two regions overlap by more than 75%: chrX 76872074 - 76872145 and 76872015 - 76872133
merged into ATRX_14.12742_1.29312.ATRX_14.12742_1.8946: chrX:76872015-76872145
These two regions overlap by more than 75%: chrX 76845225 - 76845342 and 76845238 - 76845356
merged into ATRX_9.52579_1.32513.ATRX_9.52579_1.37867: chrX:76845225-76845356
These two regions overlap by more than 75%: chrX 47426026 - 47426157 and 47426002 - 47426126
merged into SP_217.19864_1.57866.SP_217.19864_1.83905: chrX:47426002-47426157
These two regions overlap by more than 75%: chr4 84384658 - 84384779 and 84384638 - 84384758
merged into FAM175A_2.126304.FAM175A_2.23466: chr4:84384638-84384779
These two regions overlap by more than 75%: chr4 55972822 - 55972946 and 55972807 - 55972898
merged into SP_164.30020_1.130391.SP_164.30020_1.33497: chr4:55972807-55972946
These two regions overlap by more than 75%: chr9 135785898 - 135786025 and 135785878 - 135786003
merged into OCP1_TSC1_34_1.180877.OCP1_TSC1_34_1.74209: chr9:135785878-135786025
These two regions overlap by more than 75%: chr13 48953705 - 48953770 and 48953628 - 48953758
merged into CHP2_RB1_5_1.194495.OCP1_RB1_24_1.50011: chr13:48953628-48953770
These two regions overlap by more than 75%: chr10 89624190 - 89624321 and 89624219 - 89624354
merged into CHP2_PTEN_1_1.101892.CHP2_PTEN_1_1.138223: chr10:89624190-89624354
These two regions overlap by more than 75%: chr10 43615517 - 43615607 and 43615456 - 43615591
merged into CHP2_RET_4_1.138016.CHP2_RET_4_1.5615: chr10:43615456-43615607
These two regions overlap by more than 75%: chr5 67591990 - 67592115 and 67591973 - 67592088
merged into OCP1_PIK3R1_35_1.179919.OCP1_PIK3R1_35_1.186198: chr5:67591973-67592115

--Coordinates are read--

Total target length: 719043

Illegal division by zero error in ONCOCNV_getCounts.pl

I am using the wrapper shell script (ONCOCNV.sh) to run ONCOCNV (version 6.6) and am getting a division by zero error in line 462.
This issue is referred to in this thread (http://seqanswers.com/forums/showthread.php?t=50211) where it is indicated that this can be caused by a malformed bed file. I have tried using my own bed file (following the instructions for formatting) and also the file provided as part of the test dataset and get the same error in both cases.
Please could you provide some advice on how I can identify the source of this error? Many thanks.

Queries on applying ONCOCNV on Exome Seq Data

Hello Sir/Ma'am,

1] Trying to apply oncocnv for exome data i.e. without reference to amplicon id. What should be modification required in bed file for applying it on Exome data.

Standard bed file:
track name="4477685_CCP_Designed" description="Amplicon_Insert_4477685_CCP" type=bedDetail
chr1 2488068 2488201 AMPL242431688 0 TNFRSF14

Bed file used for Exome data:
track name="Covered" description="Illumina Exon - Genomic regions covered by probes" db=hg19
chr1 14694 14814 AMPL1354 0 WASH7P

Above bed file used, contains random amplicon id in 3rd column and 0 in 5th column, is it right strategy to use for Exon seq data?

2] While applying ONCOCNV for EXON data what precautions or prerequisite should be taken care of?

Kindly Guide.

Regards,
Vyomesh

understanding the output of the tool

I ran this tool for my amplicon sequencing samples. but having a hard time understanding the results.
in the results predLargeSeg value is not always equal to the prePoint values (copies). How do i interpret the results and filter the results. Like in the example below prelargeSeg value is 2 but it says there are 3 copies.
And one other thing is are these ratio values are log10 or log2 ?

chr | start | end | gene | ID | ratio | predLargeSeg | segMean | pvalRatioGene | predPoint | comments
chr4 | 88926689 | 88926788 | PKD2 | PKD2.NA.chr4.88926689.88926788 | 0.336552509 | 2 | 0.086626899 | 5.51E-22 | q-value=8.5498138112145e-05, copies=3 | SegRatio=0.09,AbsMeanSigma=3.13,pvalue=1.32977929802886e-174,pvalueTTest=1.46096694800732e-12,
chr4 | 88927389 | 88927488 | PKD2 | PKD2.NA.chr4.88927389.88927488 | 0.407056826 | 2 | 0.086626899 | 5.51E-22 | q-value=3.98010118067732e-11, copies=3 | SegRatio=0.09,AbsMeanSigma=3.13,pvalue=1.32977929802886e-174,pvalueTTest=1.46096694800732e-12,

Thanks !

Usability on large sample sizes

Hi,

First of all, thank you for supporting ONCOCNV.
I'm currently trying to use ONCOCNV 6.9 on a larger sample size (~1,700 tumor-normal pairs) and
I think that this is a bit of a stretch since it took about a week to read in the data alone.

What I'm a bit confused about is that when running processControl.R
(from shell script provided) the script printed after about 10 minutes this:

"Warning: you have both male and female samples in the control. We will try to assign sex using read coverage on chrX". I'm a bit confused why this is a warning, since it's a described feature in the paper to determine sex automatically. In case this is not the intended behaviour, we already have the genders for each sample. Is it easy enough to provide a sex vector containing c(0.5, 1) to the script?

After the script is printing the warning mentioned above, it started to allocate 800% CPU and is running for 24 hours straight without printing anything else. I tried to go through the code, but I couldn't find anything that could cause this. Is fastICA() allocating multiple cores when run in C-mode without documentation?

One final question,
I tried to find the cause for the unexpected appetite and ran ONCOCNV on a smaller subset and I noticed that in line 95 (processControl.R) you set
NUMBEROFPC = ncont-1;

although you set it very explicitly to 3 a few lines earlier. fastICA() is then run with this variable -in my case, NUMBEROFPC=1704.
What is the rationale behind this, could this be the reason that ONCOCNV keeps running?

This is not exactly a bug, but it would help me a lot understanding ONCOCNV much better.
Thank you,
Robert

Error while running OnCOCNV

Hi I am getting following error while running it:

R.cache v0.12.0 (2015-11-12) successfully loaded. See ?R.cache for help.
Loading required package: lattice
Loading required package: grid
Loading required package: parallel
Error in t.test.default((geneSetout$ratio[tt] - fragRatio)/sdCorrection/geneSet$sd[tt], :
data are essentially constant
Calls: t.test -> t.test.default
In addition: Warning messages:
1: In [<-.factor(*tmp*, i, value = "NA") :
invalid factor level, NA generated
2: In [<-.factor(*tmp*, i, value = "NA") :
invalid factor level, NA generated
Execution halted

Can you please help to resolve the issue.

Problem preping stats from bam files ONCOCNV_getCounts.pl line 450

Hi,
I am using OncoCNV 6.6 to get copy number estimates on exome data.
There is an error that comes up when running the ONCOCNV_getCounts.pl

$perl $TOOLDIR/ONCOCNV_getCounts.pl getControlStats -m Exon -b $targetBed -c $controls -o $OUTDIR/Control.stats.txt

reading NA12891_Capture.MarkDuplicates.mdup..GATK.Recalibrate.bam
	sample name: NA12891_Capture.MarkDuplicates.mdup..GATK.Recalibrate

[ many lines counting reads ]
read 95700000 reads
	Total target length: 0
processed 2 controls, NA12878_Capture.MarkDuplicates.mdup..GATK.Recalibrate NA12891_Capture.MarkDuplicates.mdup..GATK.Recalibrate
Can't use an undefined value as an ARRAY reference at
	/opt/ONCOCNV-6.6/src//ONCOCNV_getCounts.pl line 450 (#1)
    (F) A value used as either a hard reference or a symbolic reference must
    be a defined value.  This helps to delurk some insidious errors.
    
Uncaught exception from user code:
	Can't use an undefined value as an ARRAY reference at /opt/ONCOCNV-6.6/src//ONCOCNV_getCounts.pl line 450.

the input files are :
bed file with the exome target regions

$head 1-target.bed 
chr1	12100	12258	AMPLENSG00000223972.4	0	ENSG00000223972.4
chr1	12555	12721	AMPLENSG00000223972.4.1	0	ENSG00000223972.4
chr1	13333	13701	AMPLENSG00000223972.4.2	0	ENSG00000223972.4
chr1	30336	30503	AMPLENSG00000243485.2	0	ENSG00000243485.2
chr1	35047	35544	AMPLENSG00000237613.2	0	ENSG00000237613.2
chr1	35620	35778	AMPLENSG00000237613.2.1	0	ENSG00000237613.2
chr1	69091	70008	AMPLENSG00000186092.4	0	ENSG00000186092.4
chr1	324296	324394	AMPLENSG00000237094.6	0	ENSG00000237094.6
chr1	324429	325605	AMPLENSG00000237094.6.1	0	ENSG00000237094.6
chr1	327736	328214	AMPLENSG00000237094.6.2	0	ENSG00000237094.6

and a bam file processed with Picard MarkDuplicates, GATK Realign and Recalibrate.
Before any processing since the BAM file contains aligned reads to unplaced scaffolds which are not present in the target.bed I am filtering with samtools -L <target.bed>

Any advice/help/suggestions are welcome
Thanks
Konstantinos

Error in if (length(which(observations == 0))/length(observations) > 0.3) { : missing value where TRUE/FALSE needed

Hi All, I am facing missing value error.Please let me know if anyone is aware of this error.

Running ONCOCNV gets different copy number prediction / p value each time

Dear authors,

Running the latest version of ONCOCNV V7.0 generates different output including copy number predictions and p values each time. I think setting a seed inside processSamples.R is needed. Would you update your package and make sure that each run generates consistent copy number prediction and p values?

Best,
Ruochen

Error in t.test.default((values)/sigma[indLargeSeg], mu = 0) : not enough 'x' observations Calls: t.test -> t.test.default In addition: There were 50 or more warnings (use warnings() to see the first 50) Execution halted

Dear authors,

This error occurs during multiple runs of the latest version (V7.0) of ONCOCNV:

Error in t.test.default((values)/sigma[indLargeSeg], mu = 0) :
not enough 'x' observations
Calls: t.test -> t.test.default
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted

I figured out that this is due to the following code in line 774:
tt <- which(predValeu==0 & ratio< -magicThreshold)

Sometimes, the output of the predictClust function (predValeu variable) involves no "0" class, but "-1", "1", and "NA". In this case, the tt will have length 0 and cause the bug I mentioned. How would you advice to redefine the tt? I want to use "tt <- which(predValeu==1 & ratio< -magicThreshold)" instead for the function to bypass this problem, but I am wondering if you have any suggestions. Would you update your package to address this bug?