Giter Site home page Giter Site logo

battenberg's Introduction

Battenberg

This repository contains code for the whole genome sequencing subclonal copy number caller Battenberg, as described in Nik-Zainal, Van Loo, Wedge, et al. (2012), Cell.

Installation instructions

The instructions below will install the latest stable Battenberg version.

Prerequisites

Installing from Github requires devtools and Battenberg requires the modified copynumber package from "igordot/copynumber" and readr, gtools, splines, ggplot2, gridExtra, RColorBrewer, VariantAnnotation and ASCAT. The pipeline requires parallel and doParallel. From the command line run:

R -q -e 'BiocManager::install(c("devtools", "splines", "readr", "doParallel", "ggplot2", "RColorBrewer", "gridExtra", "gtools", "parallel", "igordot/copynumber", "VariantAnnotation"))'
R -q -e 'devtools::install_github("Crick-CancerGenomics/ascat/ASCAT")'

Installation from Github

To install Battenberg, run the following from the command line:

R -q -e 'devtools::install_github("Wedge-Oxford/battenberg")'

Required reference files

GRCh37 reference files may downloaded from here: https://ora.ox.ac.uk/objects/uuid:2c1fec09-a504-49ab-9ce9-3f17bac531bc

The bundle contains the following files:

  • battenberg_1000genomesloci2012_v3.tar.gz
  • battenberg_impute_1000G_v3.tar.gz
  • probloci_270415.txt.gz
  • battenberg_wgs_gc_correction_1000g_v3.tar.gz
  • battenberg_wgs_replic_correction_1000g_v3.tar.gz
  • battenberg_snp6_exe.tgz (SNP6 only)
  • battenberg_snp6_ref.tgz (SNP6 only)

GRCh38 reference files may be downloaded from here: https://ora.ox.ac.uk/objects/uuid:08e24957-7e76-438a-bd38-66c48008cf52

The bundle contains the following files:

  • 1000G_loci_hg38.zip
  • imputation.zip
  • beagle5.zip
  • probloci.zip
  • GC_correction_hg38.zip
  • RT_correction_hg38.zip
  • README.txt

Pipeline

Go into inst/example for example WGS and SNP6 R-only pipelines.

Description of the output

Key output files

  • [samplename]_subclones.txt contains the copy number data (see table below)
  • [samplename]_rho_and_psi.txt contains the purity estimate (make sure to use the FRAC_genome, rho field in the second row, first column)
  • [samplename]_BattenbergProfile*png shows the profile (the two variants show subclonal copy number in a different way)
  • [samplename]_subclones_chr*.png show detailed figures of the copy number calls per chromosome
  • [samplename]_distance.png This shows the purity and ploidy solution space and can be used to pick alternative solutions

The copy number profile saved in the [samplename]_subclones.txt is a tab delimited file in text format. Within this file there is a line for each segment in the tumour genome. Each segment will have either one or two copy number states:

  • If there is one state that line represents the clonal copy number (i.e. all tumour cells have this state)
  • If there are two states that line represents subclonal copy number (i.e. there are two populations of cells, each with a different state)

A copy number state consists of a major and a minor allele and their frequencies, which together add give the total copy number for that segment and an estimate fraction of tumour cells that carry each allele.

The following columns are available in the Battenberg output:

Column Description
chr The chromosome of the segment
startpos Start position on the chromosome
endpos End position on the chromosome
BAF The B-allele frequency of the segment
pval P-value that is obtained when testing whether this segment should be represented by one or two states. A low p-value will result in the fitting of a second copy number state
LogR The log ratio of normalised tumour coverage versus its matched normal sequencing sample
ntot An internal total copy number value used to determine the priority of solutions. NOTE: This is not the total copy number of this segment!
nMaj1_A The major allele copy number of state 1 from solution A
nMin1_A The minor allele copy number of state 1 from solution A
frac1_A Fraction of tumour cells carrying state 1 in solution A
nMaj2_A The major allele copy number of state 2 from solution A. This value can be NA
nMin2_A The minor allele copy number of state 2 from solution A. This value can be NA
frac2_A Fraction of tumour cells carrying state 2 in solution A. This value can be NA
SDfrac_A Standard deviation on the BAF of SNPs in this segment, can be used as a measure of uncertainty
SDfrac_A_BS Bootstrapped standard deviation
frac1_A_0.025 Associated 95% confidence interval of the bootstrap measure of uncertainty

Followed by possible equivalent solutions B to F with the same columns as defined above for solution A (due to the way a profile is fit Battenberg can generate a series of equivalent solutions that are reported separately in the output).

Plots for QC

It also produces a number plots that show the raw data and are useful for QC (and their raw data files denoted by *.tab)

  • [samplename].tumour.png and [samplename].germline.png show the raw BAF and logR
  • [samplename]_coverage.png contains coverage divided by the mean coverage of both tumour and normal
  • [samplename]_alleleratio.png shows BAF*logR, a rough approximation of what the data looks like shortly before copy number calling

Intermediate figures

Finally, a range of plots show intermediate steps and can occasionally be useful

  • [samplename]_chr*_heterozygousData.png shows reconstructed haplotype blocks in the characteristic Battenberg cake pattern
  • [samplename]_RAFseg_chr*.png and [samplename]_segment_chr*.png contains segmentation data for step 1 and step 2 respectively
  • [samplename]_nonroundedprofile.png shows the copy number profile without rounding to integers
  • [samplename]_copynumberprofile.png shows the copy number profile with (including subclonal copy number) rounding to integers

Advice for including structural variant breakpoints

Battenberg can take prior breakpoints, from structural variants (SVs) for example, as input. SV breakpoints are typically much more precise and a pair of SVs can be closer together then what typically can be obtained from a BAF or coverage track. It is therefore adventageous to include prior breakpoints in a Battenberg run. However, including too many (as in 100s) incorrect breakpoints can have adverse effects by allowing many small segments to be affected by noise where there isn't any signal and increasing the runtime of the pipeline. It is therefore advised to filter prior breakpoints from SVs such that the genome is slightly oversegmented. Finally, some SV types, such as inversions, do not constitute a change in copy number and therefore also add breakpoints that should not be considered. It is therefore also advised to filter breakpoints from SVs that do not cause a change in copynumber, such as inversions.

Docker - experimental

Battenberg can be run inside a Docker container. Please follow the instructions below.

Installation

git clone [email protected]:Wedge-Oxford/battenberg.git
cd battenberg
docker build -t battenberg:2.2.9 .

Reference data

First, download the Battenberg reference data from the URL provided further in this README. Then in the impute_info.txt file, replace the paths to the reference files with /opt/battenberg_reference. I.e. the path to the first legend file should become:

/opt/battenberg_reference/1000genomes_2012_v3_impute/ALL_1000G_phase1integrated_v3_chr1_impute.legend

Run interactively

These commands run the Battenberg pipeline within a Docker container in interactive mode. This command assumes the data is available locally in $PWD/data/pcawg/HCC1143_ds and the reference files have been placed in $PWD/battenberg_reference

docker run -it -v `pwd`/data/pcawg/HCC1143_ds:/mnt/battenberg/ -v `pwd`/battenberg_reference:/opt/battenberg_reference battenberg:2.2.8

Within the Docker terminal run the pipeline, in this case on the ICGC PCAWG testing data available here.

R CMD BATCH '--no-restore-data --no-save --args -t HCC1143 -n HCC1143_BL --nb /mnt/battenberg/HCC1143_BL.bam --tb /mnt/battenberg/HCC1143.bam --sex female -o /mnt/battenberg/' /usr/local/bin/battenberg_wgs.R /mnt/battenberg/battenberg.Rout

Building a release

In RStudio: In the Build tab, click Check Package

Then open the NAMESPACE file and edit:

S3method(plot,haplotype.data)

to:

export(plot.haplotype.data)
hg38 for Beagle5

Modified original code to derive the input vcf for Beagle5 and hg38:

#!/bin/bash
#
# READ_ME file (08 Dec 2015)
# 
# 1000 Genomes Project Phase 3 data release (version 5a) in VCF format for use with Beagle version 4.x
#    Data Source: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/*
#
# NOTES:
# 
# 1) Markers with <5 copies of the reference allele or <5 copies of the non-reference alleles have been excluded.
#
# 2) Structural variants have been excluded.
#
# 3) All non-unique identifiers in the ID column are removed from the ID column
#
# 4) Additional marker filtering may be performed using the gtstats.jar and filterlines.jar utilities
#
# 5) Sample information is in files: 
#      integrated_call_samples.20130502.ALL.ped
#      integrated_call_samples_v3.20130502.ALL.panel
#
# 6) Male haploid chromosome X genotypes are encoded as diploid homozygous genotypes.
#
############################################################################
#  The following shell script was used to create the files in this folder  #
############################################################################
#

## required if loading modules
module load Java
module load HTSlib

## wget for GRCh37
## wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/*

# wget for GRCh38 (liftover from hg38)
# wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/*
# see article: https://wellcomeopenresearch.org/articles/4-50
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chr*
wget http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/utilities/filterlines.jar
wget http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/utilities/gtstats.jar
wget http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/utilities/remove.ids.jar
wget http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/utilities/simplify-vcf.jar

## Downloadable from Beagle5 web page
gts="java -ea -jar gtstats.jar"
fl="java -ea -jar filterlines.jar"
rmids="java -ea -jar remove.ids.jar"
simplify="java -ea -jar simplify-vcf.jar"
min_minor="5"

## Running directory 
src="./"
#mkdir ${src}
#cd ${src}
#cd -

## Go through autosomes and prepare vcf
for chr in $(seq 4 5); do
echo "chr${chr}"
#input="${src}ALL.chr${chr}_GRCh38.genotypes.20170504.vcf.gz"
input="${src}ALL.chr${chr}.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz"
#vcf_removefield="${input}_removedfield.vcg.gz"
vcf="chr${chr}.1kg.phase3.v5a_GRCh38.vcf.gz"
excl="chr${chr}.1kg.phase3.v5a_GRCh38.excl"

#zcat ${input} | awk '/^[^#]/ {gsub(";GRC.*","",$9);print}' > ${vcf_removefield}
zcat ${input} | grep -v 'SVTYPE' | ${gts} | ${fl} \# -13 ${min_minor} 1000000 | cut -f2 > ${excl}
zcat ${input} | grep -v 'SVTYPE' | grep -v '^#' | cut -f3 | tr ';' '\n' | sort | uniq -d > chr${chr}.dup.id

# BEGIN: add 4 duplicate markers to exclusion list
#if [ ${chr} == "8" ]; then echo "88316919"; fi >> ${excl}
#if [ ${chr} == "12" ]; then echo ""; fi >> ${excl}
#if [ ${chr} == "14" ]; then echo "21181798"; fi  >> ${excl}
#if [ ${chr} == "17" ]; then echo "1241338"; fi  >> ${excl}
# END:  add 4 duplicate markers to exclusion list

zcat ${input} | grep -v 'SVTYPE' | ${fl} \# \-2 ${excl} | ${simplify} | ${rmids} chr${chr}.dup.id | bgzip -c > ${vcf}
tabix ${vcf}
done

## Same for chromosome X                                                                                                  
chr="X"
#in="${src}ALL.chr${chr}_GRCh38.genotypes.20170504.vcf.gz"
in="${src}ALL.chr${chr}.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz"
vcf="chr${chr}.1kg.phase3.v5a_GRCh38.vcf.gz"
excl="chr${chr}.1kg.phase3.v5a_GRCh38.excl"

### sed command recodes male chrom X genotypes as homozygote diploid genotypes
### sed command makes temporary change to floating point numbers and diploid genotypes to permit use of word-boundary '\>'
cat_in=" zcat ${in} | grep -v 'SVTYPE' | sed \
-e 's/\t0\t\.\t/\t111\tPASS\t/' \
-e 's/0\tPASS/222\tPASS/' \
-e 's/\([0-9]\)\./\1WXYZ/g' \
-e 's/\([0-9]\)|\([0-9]\)/\1X\2/g' \
-e 's/\t\([0-9]\)\>/\t\1|\1/g' \
-e 's/\([0-9]\)WXYZ/\1./g' \
-e 's/\([0-9]\)X\([0-9]\)/\1|\2/g';"

echo ${chr}
  eval ${cat_in} | grep -v '^#' | cut -f3 | tr ';' '\n' | sort | uniq -d > chr${chr}.dup.id
  eval ${cat_in} | ${gts} | ${fl} '\#' -13 ${min_minor} 1000000 | cut -f2 > ${excl}

  # BEGIN: add duplicate markers to exclusion list
  #if [ ${chr} == "X" ]; then echo "5457254"; fi >> ${excl}
  #if [ ${chr} == "X" ]; then echo "32344545"; fi >> ${excl}
  #if [ ${chr} == "X" ]; then echo "68984437"; fi >> ${excl}
  # END:  add duplicate markers to exclusion list

eval ${cat_in} | ${fl} \# \-2 ${excl} | ${simplify} | ${rmids} chr${chr}.dup.id | bgzip -c > ${vcf}
tabix ${vcf}

Run R code to generate loci, allele and gc_content files:

##########################################################################
## set working directory to where the vcf files are located
setwd("./")
##########################################################################
ffs <- dir(full=T,pattern="1kg.phase3.v5a_GRCh38.vcf.gz$")
MC.CORES <- 10 ## set number of cores to use
##########################################################################


##########################################################################
library(parallel)
##########################################################################
## Further remove unreferenced alleles and duplicates
##########################################################################
mclapply(ffs[length(ffs):1],function(x)
{
    cat(paste(x,"read file"))
    out  <- gsub(".vcf.gz","nounref.vcf.gz",x)
    cmd <- paste0("zcat ",
                  x,
                  " | grep -v '",
                  "\\",
                  ".",
                  "|","' ",
                  " | grep -v '",
                  "|",
                  "\\",
                  ".",
                  "' "
                 ,"|"," awk '!seen[$2]++' |  gzip > ",
                  out)
    system(cmd)
},mc.cores=MC.CORES)
##########################################################################

ffs <- dir(full=T,pattern="1kg.phase3.v5a_GRCh38nounref.vcf.gz$")
## Generate Loci files
##########################################################################
mclapply(ffs[length(ffs):1],function(x)
{
    cat(paste(x,"read file"))
    out  <- gsub(".vcf.gz","_loci.txt",x)
    cmd <- paste0("zcat ",
                  x," | grep -v '^#' | awk -v OFS='\t' '{print $1, $2}' > ",
                  out)
    system(cmd)
},mc.cores=MC.CORES)
##########################################################################
## with chr string (for BAMs that contain "chr")
mclapply(ffs[length(ffs):1],function(x)
{
    ##cat(paste(x,"read file"))
    out  <- gsub(".vcf.gz","_loci_chrstring.txt",x)
    cmd <- paste0("zcat ",
                  x," | grep -v '^#' | awk -v OFS='\t' '{print ",
                  '"chr"'
                 ,"$1, $2}' > ",
                  out)
    system(noquote(cmd))
},mc.cores=MC.CORES)
##########################################################################
## Generate Allele Files
##########################################################################
mclapply(ffs[length(ffs):1],function(x)
{
    cat(paste(x,"read file"))
    out  <- gsub(".vcf.gz","_allele_letter.txt",x)
    cmd <- paste0("zcat ",
                  x," | grep -v '^#' | awk -v OFS='\t' '{print $2, $4, $5}' > ",
                  out)
    system(cmd)
},mc.cores=MC.CORES)
##########################################################################
## Convert Alleles into Indices for BB input
indices <- c("A"=1,"C"=2,"G"=3,"T"=4)
##########################################################################
mclapply(ffs[length(ffs):1],function(x)
{
    cat(".")
    inp <- gsub(".vcf.gz","_allele_letter.txt",x)
    out <- gsub(".vcf.gz","_allele_index.txt",x)
    df <- as.data.frame(data.table::fread(inp))
    ref <- indices[df[,2]]
    alt <- indices[df[,3]]
    ndf <- data.frame(position=df[,1],
                      a0=ref,
                      a1=alt)
    write.table(ndf,file=out,
                row.names=F,col.names=T,sep="\t",quote=F)
},mc.cores=5)
##########################################################################


##########################################################################
## Symlink loci to change the names allowing for a "prefix" before chr in BB
##########################################################################
allfs <- dir(full=T)
allfs_loci <- allfs[grepl("loci.txt$",allfs)]
tnull <- lapply(allfs_loci,function(x)
{
    cmd <- paste0("ln -s ",x," ",gsub("chr(.*?)\\.(.*).txt","\\2_chr\\1.txt",x))
    system(cmd)
    if(grepl("chrX",x))
    {
        cmd <- paste0("ln -s ",x," ",gsub("chr(.*?)\\.(.*).txt","\\2_chr23.txt",x))
        system(cmd)
    }
})
##########################################################################
allfs <- dir(full=T)
allfs_loci <- allfs[grepl("loci_chrstring.txt$",allfs)]
tnull <- lapply(allfs_loci,function(x)
{
    cmd <- paste0("ln -s ",x," ",gsub("chr(.*?)\\.(.*).txt","\\2_chr\\1.txt",x))
    system(cmd)
    if(grepl("chrX",x))
    {
        cmd <- paste0("ln -s ",x," ",gsub("chr(.*?)\\.(.*).txt","\\2_chr23.txt",x))
        system(cmd)
    }
})
##########################################################################
## Symlink alleles: same as for loci, symlink to change name for the
## use of prefixes
##########################################################################
allfs <- dir(full=T)
allfs_index <- allfs[grepl("allele_index",allfs)]
tnull <- lapply(allfs_index,function(x)
{
    cmd <- paste0("ln -s ",x," ",gsub("chr(.*?)\\.(.*).txt","\\2_chr\\1.txt",x))
    system(cmd)
    if(grepl("chrX",x))
    {
        cmd <- paste0("ln -s ",x," ",gsub("chr(.*?)\\.(.*).txt","\\2_chr23.txt",x))
        system(cmd)
    }
})
       
##########################################################################
##  Derive GC content files
##########################################################################
library(Rsamtools)
library(data.table)
library(Biostrings)
##########################################################################
gcTrack <- function(chr,
                    pos,
                    dna,
                    window=5000)
{
    gc <- rowSums(letterFrequencyInSlidingView(dna[[chr]],
                                               window,
                                               c("G","C")))/window
    gc[pos]
}
getRefGenome <- function (fasta = FASTA, CHRS = paste0("", c(1:22, "X", "Y", 
    "MT"))) 
{
    dna <- Biostrings::readDNAStringSet(fasta, format = "fasta")
    dna <- lapply(1:length(CHRS), function(x) dna[[x]])
    names(dna) <- CHRS
    return(dna)
}
##########################################################################
FASTA <- "genome.fa" ## Link to genome reference fasta file 
CHRS <- paste0("", c(1:22, "X"))
BEAGLELOCI.template <- "chrCHROMNAME.1kg.phase3.v5a_GRCh38nounref_loci.txt"
##########################################################################
REFGENOME <- getRefGenome(fasta = FASTA, CHRS = CHRS) ## Loads genome
in memory
##########################################################################
OUTDIR <- "1000genomes_2012_v3_gcContent_hg38"
system(paste0("mkdir ",OUTDIR))
##########################################################################

##########################################################################
windows <- c(25,
             50,
             100,
             200,
             500,
             1000,
             2000,
             5000,
             10000,
             20000,
             50000,
             100000,
             200000,
             500000,
             1000000,
             2000000,
             5000000,
             10000000)
names(windows) <- formatC(windows,format="f",digits=0)
names(windows) <- gsub("000000$","Mb",names(windows))
names(windows) <- gsub("000$","kb",names(windows))
names(windows) <- sapply(names(windows),function(x) if(grepl("[0-9]$",x)) paste0(x,"bp") else x)
##########################################################################

writeGC <- function(gccontent,chr,outdir)
{
    write.table(gccontent,
                file=gzfile(paste0(outdir,"/1000_genomes_GC_corr_chr_",chr,".txt.gz")),
                col.names=T,
                row.names=T,quote=F,sep="\t")
}

mclapply(CHRS,function(chr)
{
	cat(chr)
	snppos <- as.data.frame(data.table::fread(gsub("CHROMNAME",chr,BEAGLELOCI.template)))
    gccontent <- sapply(windows,function(window) gcTrack(chr=chr,
                                                         pos=snppos[,2],
                                                         dna=REFGENOME,
                                                         window=window))*100
    gccontent <- cbind(rep(chr,nrow(gccontent)),snppos[,2],gccontent)
    colnames(gccontent)[1:2] <- c("chr","Position")
    rownames(gccontent) <- snppos[,2]
    writeGC(gccontent,chr,OUTDIR)
},mc.cores=10)
   
Example run

To run using Beagle5, simply parametrise the same way you would run under impute2. It should be back compatible, so you can run impute2 by setting usebeagle=FALSE. And it uses the same input files needed for the pipeline, i.e. 1000G loci/alleles + ref panel + prob loci + imputeinfo file etc.

The map plink files for Beagle can be downloaded from: http://bochet.gcc.biostat.washington.edu/beagle/genetic_maps/

BEAGLEJAR <- "$PATHTOBEAGLEFILES/beagle.24Aug19.3e8.jar"
BEAGLEREF.template <- "$PATHTOBEAGLEFILES/chrCHROMNAME.1kg.phase3.v5a.b37.bref3"
BEAGLEPLINK.template <- "$PATHTOBEAGLEFILES/plink.chrCHROMNAME.GRCh37.map"

timed <- system.time(battenberg(tumourname=TUMOURNAME,
                                normalname=NORMALNAME,
                                tumour_data_file=TUMOURBAM,
                                normal_data_file=NORMALBAM,
                                imputeinfofile=IMPUTEINFOFILE,
                                g1000prefix=G1000PREFIX,
                                problemloci=PROBLEMLOCI,
                                gccorrectprefix=GCCORRECTPREFIX,
                                repliccorrectprefix=REPLICCORRECTPREFIX,
                                g1000allelesprefix=G1000PREFIX_AC,
                                ismale=IS_MALE,
                                data_type="wgs",
                                impute_exe="impute2",
                                allelecounter_exe="alleleCounter",
                                nthreads=NTHREADS,
                                platform_gamma=1,
                                phasing_gamma=1,
                                segmentation_gamma=10,
                                segmentation_kmin=3,
                                phasing_kmin=1,
                                clonality_dist_metric=0,
                                ascat_dist_metric=1,
                                min_ploidy=1.6,
                                max_ploidy=4.8, min_rho=0.1,
                                min_goodness=0.63,
                                uninformative_BAF_threshold=0.51,
                                min_normal_depth=10,
                                min_base_qual=20,
                                min_map_qual=35,
                                calc_seg_baf_option=1,
                                skip_allele_counting=F,
                                skip_preprocessing=F,
                                skip_phasing=F,
                                usebeagle=USEBEAGLE, ##set to TRUE to use beagle
                                beaglejar=BEAGLEJAR, ##path
                                beagleref=BEAGLEREF.template, ##pathtemplate
                                beagleplink=BEAGLEPLINK.template, ##pathtemplate
                                beaglemaxmem=15, 
                                beaglenthreads=1,
                                beaglewindow=40,
                                beagleoverlap=4,
                                snp6_reference_info_file=NA,
                                apt.probeset.genotype.exe="apt-probeset-genotype",
                                apt.probeset.summarize.exe="apt-probeset-summarize",
                                norm.geno.clust.exe="normalize_affy_geno_cluster.pl",
                                birdseed_report_file="birdseed.report.txt",
                                heterozygousFilter="none",
                                prior_breakpoints_file=NULL))

battenberg's People

Contributors

clemencyjolly avatar davidwedge avatar galder-max avatar haasek avatar jcesar101 avatar jdemeul avatar kzkedzierska avatar nansari-pour avatar rulixxx avatar sdentro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

battenberg's Issues

Empty reference file shapeit2 chrX_nonPAR

Hi there,

The reference file ALL.v1a.shapeit2_integrated_chrX_nonPAR.GRCh38.20181129.phased.hap.gz provided here appears to be empty.

This caused an error in impute2, which did not get picked up by Battenberg:

ERROR: There are no type 2 SNPs after applying the command-line settings for this run, which makes it impossible to perform imputation. One possible reason is that you have specified an analysis interval (-int) that contains reference panel SNPs but not inference panel SNPs -- e.g., this can happen at the ends of chromosomes. Another possibility is that your genotypes and the reference panel are mapped to different genome builds, which can lead the same SNPs to be assigned different positions in different panels. If you need help fixing this error, please contact the authors.

and so the CN profiles generated had a large chunk of X with no breakpoints across a fairly large cohort of samples.

I assume this file should not be empty, so would it be possible to provide the correct version?

Thanks so much!

Incorrect variable name in prepare_wgs.R

Hi,

Thanks for the great package.
I was keep having the error saying "subscript out of bounds" while running prepare_wgs -> getBAFsAndLogRs.
This is the stacktrace:

Error in as.matrix(x)[i] : subscript out of bounds
In addition: There were 11 warnings (use warnings() to see them)

Enter a frame number, or 0 to exit

1: source("battenberg_wgs_debug.R")
2: withVisible(eval(ei, envir))
3: eval(ei, envir)
4: eval(ei, envir)
5: battenberg_wgs_debug.R#79: battenberg(tumourname = TUMOURNAME, normalname =
6: prepare_wgs(chrom_names = chrom_names, tumourbam = tumour_data_file, normal
7: getBAFsAndLogRs(tumourAlleleCountsFile.prefix = paste(tumourname, "_alleleF
8: mutant_data[cbind(1:len, allele_data[, 3])]
9: `[.data.frame`(mutant_data, cbind(1:len, allele_data[, 3]))

Debugging it showed me that in line 62 of prepare_wgs.R had a wrong variable:
normal_input_data = normal_input_data[chrpos_tumour %in% matched_data,]
Here, it seems that chrpos_tumour needs to be chrpos_normal for it to be match in dimensions.
If it is not written as chrpos_tumour on purpose, please fix the error. Thanks!

Error in if (nMinor < 0) { : missing value where TRUE/FALSE needed Calls: callSubclones -> determine_copynumber

Hi,

This is very rare but a few samples had this issue below when calling callSubclones().

Error in if (nMinor < 0) { : missing value where TRUE/FALSE needed
Calls: callSubclones -> determine_copynumber

This may be due to ploidy NA (FRAC_GENOME) in sample_rho_and_psi.txt?

rho psi ploidy distance is.best
ASCAT 1 2.15 2.10143838105564 NA NA
FRAC_GENOME 0.95 NA 2.09906082321258 0 FALSE
REF_SEG 1 2 2 1 TRUE

Thanks,

Taka

difference between sublones.txt and subclones_1.txt

Hi,
when running the call_subclones command of the Battenberg pipe line two sublcones files are generated.

subclones_1.txt (generated first and usually with more CN segments
sublones.txt

this appears to be independent of the number of copy number solutions for that sample.
What is the difference?

many thanks

Jamie

dev branch fails to install

FYI installation fails on the latest dev commit:

Error in parse(outFile) :                                                                                                                                                 
  /tmp/Rtmpu8a736/R.INSTALLab700ac38d/Battenberg/R/battenberg.R:307:72: unexpected ','                                                                                    
306:       # rename the original files without multisample phasing info                                                                                                   
307:       MutBAFfiles <- paste0(tumourname[sampleidx], "_chr", chrom_names),

'sv_breakpoints_file = NULL' option in callSubclones()

Hey guys!

sv_breakpoints_file = NULL or sv_breakpoints_file = NA throws an error as neither NULL == "NA" or NA == "NA" returns a TRUE/FALSE statement in R for the if statement if (!is.null(sv_breakpoints_file) & !sv_breakpoints_file == "NA").

Hope you are all well!

George

Run refit using refit_suggestion.txt

Hello,

I was wondering how to run refit using the refit suggestions generated in the initial run?

Also what is the difference between the *subclones_1.txt and *subclones.txt outputs.

Many Thanks,
Rashesh

Rework workflow to remove unneeded data reads

The workflow could be changed to remove the need to read in data at various steps.

The workflow could exist of essentially 4 main blocks:

  • preparation
  • haplotyping (everything that can be run in parallel)
  • segmentation
  • fitting copy number

These can be implemented in 4 separate R scripts that can be called by cgpBattenberg

Changing max_rho from 1 to 0.9 will break ascat.plotSunrise()

Hi,

As sometimes purity seems to be overestimated, I tried max_rho = 0.9 but it caused the issue below.

Error in axis(2, at = seq(0, 1/purity_max, by = 1/3/purity_max), labels = seq(purity_min, :
'at' and 'labels' lengths differ, 4 != 3
Calls: fit.copy.number -> runASCAT -> -> axis

I'm currently testing min_rho = 0, which may fix the issue but I think the axis labels should handle different min max rho values...?

Thanks,

Taka

Error in callChrXsubclones function

Hello,

I'm running Battenberg for hg19 with a case that doesn't seem to contain any copy number alteration. However, I'm not be able to complete the analysis because of the following error: Error in { : task 1 failed - "missing value where TRUE/FALSE needed" .

I tracked the error to the callChrXsubclones function. I'll try to explain the reason and the two possible solutions I've thought of, but I don't know if they are correct or maybe something else needs to be changed:

  • In my case, nrow(BBloh) = 1 in line 62
  • So, it enters in the if statement of line 64, but BBloh remains 1
  • In line 77 seg$type = "loss" according to seg$mean and in line 84 seg$CNA = "yes"

Up to this point, the declaration of the variables seems to be correct.

  • Because of seg$CNA = "yes", it enters in the if statement of line 87, where the problem appears in line 126. The code of this part is:

else if (seg$type == "loss") {
seg$CN = 0
if (nrow(BBloh) > 0) {
seg$clonal = ifelse(round(abs(explogrLoss - seg$mean), digits = 2) < round(abs(sd(BBloh$LogR)/explogrLoss), digits = 2), "yes", "no")
}
else if (nrow(BBloh) == 0) {
seg$clonal = ifelse(round(abs(explogrLoss - seg$mean), digits = 2) < round(abs(BBsd_max/explogrLoss), digits = 2), "yes", "no")
}
}

  • As nrow(BBloh) = 1, it enters in the first if, but when the function tries to calculate the sd(BBloh$LogR), it returns seg$clonal = NA because there is only one number in the variable.
  • Because this reason, in line 173 if (seg$clonal == "no"), the above error appears and the program finishes incomplete.

The solutions I propose are:

  1. In line 126: Change if (nrow(BBloh) > 0) by if (nrow(BBloh) > 1) and in line 131 else if (nrow(BBloh) == 0)by else if (nrow(BBloh) <= 1). So you don't have to calculate the sd of just one number.
  2. Or change lines 80 and 84 so if (nrow(BBloh) == 1){seg$CNA == "no"}. Thus, it enters in the else statement of line 138 without errors.

My BBloh looks like:

   chr startpos   endpos       BAF      pval        LogR     ntot
10   9 44891284 44900981 0.7768091 0.5352291 -0.05235313 1.919503
   nMaj1_A nMin1_A frac1_A nMaj2_A nMin2_A frac2_A SDfrac_A
10       1       0       1      NA      NA      NA       NA
   SDfrac_A_BS frac1_A_0.025 frac1_A_0.975
10          NA            NA            NA

And SAMPLEsegs:

   sampleID chrom arm start.pos   end.pos n.probes    mean
1 2102635TD     X   p   2600150  58563072   375739 -0.1034
2 2102635TD     X   q  61690452 155255094   636530 -0.1074

It is very likely that the error occurred because the case doesn't have a CNA, however I believe that this situation should be contemplated (or at least given a proper error).

Set seed and save to disk

Determine a seed and use that when randomising the alleles and when running impute. Save it to a file such that a rerun of BB will yield the exact same results. This negates the necessity to keep all the output.

Warning: invalid package '/opt/battenberg'

Hello,

I am getting the below error while building the docker image.
Jorge

Step 13/20 : RUN R -q -e 'install.packages("/opt/battenberg", repos=NULL, type="source")'
---> Running in 6d840fc92d87

install.packages("/opt/battenberg", repos=NULL, type="source")
Installing package into '/usr/local/lib/R/site-library'
(as 'lib' is unspecified)
Warning: invalid package '/opt/battenberg'
Error: ERROR: no packages specified
Warning message:
In install.packages("/opt/battenberg", repos = NULL, type = "source") :
installation of package '/opt/battenberg' had non-zero exit status

Subclones output file

Hi,
A scientist in our group has found a case where the cellular fraction (frac2_A) of the nMaj2_A and nMin2_A columns has a higher value than frac1_A from the callSubclones subroutine (version 2.2.5). We would be very grateful if you could explain why this is the case.

For example:

zcat subclones.txt.gz | awk 'BEGIN{ FS=OFS="\t" }{ print $1,$2,$3,$8,$9,$10,$11,$12,$13 }' | column -t | less -S
chr startpos endpos nMaj1_A nMin1_A frac1_A nMaj2_A nMin2_A frac2_A
...
12 19462666 19698320 3 0 0.381011741894733 3 1 0.618988258105267
...
12 37866866 83372701 2 1 0.431746072237525 3 1 0.568253927762475
...
12 83501125 133836024 2 0 0.332939966391955 2 1 0.667060033608045
...

Are these fractions supposed to be listed with major fraction on the left and minor fraction on the right, which is not what we are seeing here?
Regards
Kathryn

Bug in chr names in parse.imputeinfofile

Hi,
There is a bug in the below lines causing hg38 (contigs with chr prefix) to fail.

battenberg/R/impute.R

Lines 68 to 71 in 604a5d2

chr_names=unique(impute.info$chrom)
# Subset for a particular chromosome
if (!is.na(chrom)) {
impute.info = impute.info[impute.info$chrom==chr_names[chrom],]

Specifically, this line: chr_names = unique(impute.info$chrom)
makes a list of unique chromosomes names and stores it in chr_names.

But this line accesses chr_names[chrom] using the chromosome name, but chr_names is not a named vector.
For example:

> chr_names
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrX" 
> chrom
[1] "chr10"
> chr_names[chrom]
[1] NA

Is there a version of Battenberg that can handle the "chr" prefixed contig names, specifically for hg38?

Thanks.

Rewrite fitcopynumber to not need full data

fitcopynumber could possibly run from just the segmented data and it's worth exploring whether this is possible. It would mean the raw data doesn't have to be read in and is not needed after impute in general.

Error in write_battenberg_phasing function

Hello,

While attempting to run hg38 aligned BAMs through Battenberg (hg38_ChrNotation_fix_NAP branch) using the hg38 reference data. I encountered an error that stems from the write_battenberg_phasing function which halted execution. The log file which I've attached dictates the following:


Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column
Calls: battenberg ... write_battenberg_phasing -> merge -> merge.data.frame -> fix.by


To try and troubleshoot the problem, I worked through the possible lines that could be the cause.
The bafsegmented_file as well as all SNPfiles and imputedHaplotypeFiles exist and are appear to be populated correctly.
And what makes things more confusing is that the output VCF files from this function all seem to have been produced (filename convention - [tumorname]Battenberg_phased[chrom].vcf) and are populated.

battenberg_output.log

Docker build error & No permissions to output from docker...

  1. I was not able to build the Battenberg 2.2.9 docker image sucessfully. What worked was pulling files from the github: OpenGenomics/Battenberg. After this, I was able to get the docker to run and enter the terminal.

  2. When I run:

R CMD BATCH '--no-restore-data --no-save --args -t P2055N -n P1907T --nb /mnt/battenberg/P1907T.final.bam --tb /mnt/battenberg/P2055N.final.bam --sex female -o /mnt/battenberg/' /usr/local/bin/battenberg_wgs.R /mnt/battenberg/battenberg.Rout

I get the output:

'/usr/lib/R/bin/BATCH: 60: cannot create /mnt/battenberg/battenberg.Rout: Permission denied'

Does anyone have a solution to get permission to get permission to write the output?

Can I skip some code to get only 2 output files?

Hello,

I am running Battenberg to get these two outputs:
[samplename]_subclones.txt, [samplename]_rho_and_psi.txt

Is there a way to skip some of the code to speed things up and get only these outputs?
Currently one run takes 5 hours or more.

Thanks,
Naama

error when producing sunRise plot

Hi,
We have a case where no copy number solutions have been found for a data set, with the appropriate message written to the cnaStatusFile. However, we then get an error when trying to produce the sunrise plot:
Error in if (psi_opt1 > 0 && rho_opt1 > 0) { : missing value where TRUE/FALSE needed
Calls: fit.copy.number -> runASCAT -> <Anonymous>
In addition: Warning message:
In log(d) : NaNs produced

Should the code exit if no copy number solutions have been found and not try to create a sunrise plot? If it is possible to produce a valid sunrise plot in this case, would it possible for you to suggest a cause for the error we are seeing?
Regards
Kathryn

Can Battenberg be used with ASCATngs output?

Hi,

Is it possible to run the battenberg pipeline over ASCAT or ASCATngs output? For example I have the allele counts for both the tumor and the normal sample from ASCATngs, could I just feed it in to the battenberg function and run the pipeline from there?

I tried doing so, with the code below but I get an error message telling me that the unused argument (normal_name = "MCF10A"). When I remove the normal_name parameter, it fails and gives this error Error in prepare_wgs(chrom_names = chrom_names, tumourbam = tumour_data_file, : argument "normalname" is missing, with no default. I tried feeding the battenberg function analysis = "cell_line", but I get the error unused argument (analysis= "cell_line").

Any help would be greatly appreciated, thanks!

Code to try and run battenberg pipeline:
battenberg(tumourname= 'H_LS-BH-A208-01A-11D-A159-09', tumour_data_file = '~/tmpAscat/ascat/H_LS-BH-A208-01A-11D-A159-09_alleleCounts.tab', normal_name ='MCF10A', normal_data_file = '~/tmpAscat/ascat/MCF10A_0_EDHG200003860-1a_H2L5YDSXY_L3.count' , ismale=FALSE, imputeinfofile='~/battenberg/reference/hg38/imputation/impute_info.txt', g1000prefix='~/battenberg/reference/hg38/1000G_loci/1kg.phase3.v5a_GRCh38nounref_loci_chr', g1000allelesprefix='~/battenberg/reference/hg38/1000G_loci/1kg.phase3.v5a_GRCh38nounref_allele_index_chr', gccorrectprefix='~/battenberg/reference/hg38/GC_correction/1000G_GC_chr', repliccorrectprefix='~/battenberg/reference/hg38/replicationtimings/1000G_RT_chr', problemloci='~/battenberg/reference/hg38/problem_loci/probloci.txt.gz', data_type='WGS', skip_allele_counting=T, skip_preprocessing=F, skip_phasing=F, prior_breakpoints_file=NULL)

Warmest regards,
Hannan

Issues with hg38 aligned BAMs and new hg38 reference data

Hello,

With the recent release of the hg38 reference data, I have been working to run Battenberg with my hg38 aligned BAM files but have run into multiple issues so I wanted to make others aware.

First, I recommend those using looking to use the hg38 reference data to download them from the Dropbox site https://www.dropbox.com/sh/bize1n830t0mgzb/AACuzQiHbjQGqIuTJC3Xahzda?dl=0 instead of the ORA site as there seemed to be an issue with completing the download of the beagle5.zip. Also, the ORA site is completely missing the shapeit2 files that are used in the impute_info.txt file. Lastly for this point, the README.txt for the reference data is incomplete (there needs to be symlinks for the chrX files in the 1000G_loci_hg38 directory)

Second, the current version of the master branch (which is the default branch for those cloning the repo) does not support the new hg38 reference data. The hg38_ChrNotation_fix_NAP branch has implemented a fix for working with the hg38 reference data that will allow you to get past the first step (the prepare_wgs function call in the battenberg.R script) correctly.

Third, when using the hg38_ChrNotation_fix_NAP branch, the battenberg_wgs.R script requires extensive rewriting of default variables for compatibility with the hg38 reference data. The "..../" is used here as a placeholder for some user defined parent directory. The second half of the variables are specific for the use of Beagle5. Important notes for this part include: running Beagle5 requires Java 8, the JAR executable file must be downloaded https://faculty.washington.edu/browning/beagle/b5_0.html, all the reference files downloaded from the beagle5.zip need to moved into a new subdirectory that matches the GENOME_VERSION string (i.e. ..../beagle5/b38), and the JAR file must be in the ..../beagle5 directory.


IMPUTEINFOFILE = "..../imputation/impute_info.txt"
G1000PREFIX = "..../1000G_loci_hg38/1kg.phase3.v5a_GRCh38nounref_allele_index_chr"
G1000PREFIX_AC = "..../1000G_loci_hg38/1kg.phase3.v5a_GRCh38nounref_loci_chrstring_chr"
GCCORRECTPREFIX = "..../GC_correction_hg38/1000G_GC_chr"
REPLICCORRECTPREFIX = "..../RT_correction_hg38/1000G_RT_chr"
PROBLEMLOCI = "..../probloci.txt.gz"

GENOME_VERSION = "b38"
BEAGLE_BASEDIR = "..../beagle5"
BEAGLEJAR = file.path(BEAGLE_BASEDIR, "beagle.12Jul19.0df.jar")
BEAGLEREF.template = file.path(BEAGLE_BASEDIR, GENOME_VERSION, "chrCHROMNAME.1kg.phase3.v5a_GRCh38nounref.vcf.gz")
BEAGLEPLINK.template = file.path(BEAGLE_BASEDIR, GENOME_VERSION, "plink.chrCHROMNAME.GRCh38.map")


Fourth, there is a hardcoded variable in the battenberg function within the battenberg.R called "GENOMEBUILD" which is set to "hg19". This must be edited in place to "hg38".

Finally, even after all this, I have still been unsuccessful in having the pipeline execute 100% properly. I will detail the current issue in a new thread. I am grateful that the developers have produced the hg38 reference files, I just wish there also would have been a warning that they aren't fully ready for use in the current version of the software.

Instruction about the result

Hello:

I just some results files from Batternberg algorithm. Is there any instruction or readme file the interpret the result? here is all the files, do you think the program is completed or not? Thanks

-rw-r--r-- 1 zhangt8 zhangt8 201M Jul 18 01:02 GT01062_2017_allelecounts.tar.gz
-rw-r--r-- 1 zhangt8 zhangt8 212M Jul 18 01:03 GT01062_2018_allelecounts.tar.gz
-rw-r--r-- 1 zhangt8 zhangt8 9.8M Jul 18 01:03 GT01062_2018_hetbaf.tar.gz
-rw-r--r-- 1 zhangt8 zhangt8 9.5M Jul 18 01:03 GT01062_2018_hetdata.tar.gz
-rw-r--r-- 1 zhangt8 zhangt8 369M Jul 18 01:05 GT01062_2018_impute_input.tar.gz
-rw-r--r-- 1 zhangt8 zhangt8 362M Jul 18 01:06 GT01062_2018_impute_output.tar.gz
-rw-r--r-- 1 zhangt8 zhangt8 18M Jul 18 01:07 GT01062_2018_logR_Baf_segmented.vcf.gz
-rw-r--r-- 1 zhangt8 zhangt8 1.2M Jul 18 01:07 GT01062_2018_logR_Baf_segmented.vcf.gz.tbi
-rw-r--r-- 1 zhangt8 zhangt8 3.0M Jul 18 01:03 GT01062_2018_other.tar.gz
-rw-r--r-- 1 zhangt8 zhangt8 3.3M Jul 18 01:03 GT01062_2018_rafseg.tar.gz
drwxr-xr-x 2 zhangt8 zhangt8 4.0K Jul 18 01:03 GT01062_2018_subclones/
-rw-r--r-- 1 zhangt8 zhangt8 4.4M Jul 18 01:03 GT01062_2018_subclones.tar.gz
drwxr-xr-x 14 zhangt8 zhangt8 512K Jul 18 01:07 tmpBattenberg/

Also, I got an error information during running:
Use of uninitialized value in concatenation (.) or string at /opt/wtsi-cgp/lib/perl5/Sanger/CGP/Battenberg/Implement.pm line 1, do you know what's issue here?

Thanks.

plot error at plothaplotypes.23.

Hi,

Some samples had this error below at plothaplotypes.23. It seems that chrX caused some issues. Have you guys seen this error before?

Calls: plot.haplotype.data ... create.haplotype.plot -> plot -> plot.default -> localWindow -> plot.window
In addition: Warning messages:
1: In min(mut_data$Position, na.rm = T) :
no non-missing arguments to min; returning Inf
2: In max(mut_data$Position, na.rm = T) :
no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
Execution halted

Thanks,

Taka

allelecounter.exe

Hello. Where and how can one get access to the allelecounter.exe that is required for this to run?
Apologies if I'm missing something really obvious but I've been searching for a while and can't find it. There is a allelecounter.PY file here https://github.com/secastel/allelecounter but this doesn't work with the script I'm trying to use that relies on the .exe
Thanks in advance for your help
CGS (CRUK CI)

Ref files for hg38

Hello,

Do you currently offer reference files for hg38? If not, would you recommend just using Liftover?

Thank you!

chr names bug

allele_data[,1] = gsub("chr","",allele_data[,1])
normal_input_data[,1] = gsub("chr","",normal_input_data[,1])
input_data[,1] = gsub("chr","",input_data[,1])

Hi,
Note these 'gsub' lines in 2.2.10 are causing a bug in running hg38 with "chr" prefix contigs. These lines are removing "chr" and causing the subsequent line in gc.correct.wgs to fail to match up entries between Tumor_LogR and GC_data.

locimatches = match(x = paste0(Tumor_LogR$Chromosome, "",
Tumor_LogR$Position), table = paste0(GC_data$chr, "
",
GC_data$Position))

Build in ASCAT package - remove duplicate code

BB should not have the same code that is included into ASCAT. There are some issues when loading both packages at the moment as functions are overloaded.

Currently the recommended procedure is to load Battenberg before ASCAT if ASCAT is to be used and ASCAT before Battenberg when BB is to be used.

Difference between the sum of copy number segments and ntot values.

Hi,
I hope you can help me to understand the output from battenberg.
In the output of the call subclones part of the pipline a "segmentation file" is produced. This file has the clonal / subclonal copy number for each segment in the genome.
Here is a section that deomonstrates my confusion

chr startpos endpos BAF pval LogR ntot nMaj1_A nMin1_A frac1_A

21 5066290 5377969 0.534632035 0.203264531 0.073106415 2.109409229 1 1 1

21 5379291 13227129 0.790123457 0.090253745 0.064570678 2.096525116 2 1 1

21 13227353 46672988 0.5 1 0.004959692 2.008640488 1 1 1

Put most simply for regions with no called clonality (eg row 2) why is the sum of the major and minor alleles not equal to ntot. What does the ntot represent? How should there differences be reconciled?

many thanks

step producing file <tumour_sample>_battenberg_cn.vcf.gz

Hello Dave,

I was wondering if you could let me know which step of battenberg.pl produces these two files:
<tumour_sample>_battenberg_cn.vcf.gz
<tumour_sample>_battenberg_cn.vcf.gz.tbi

The reason for asking is that I cannot find them and yet the pipeline seems completed.

Many thanks
Jorge

Issues with running battenberg with hg38

Hello,

I have been trying to run Battenberg with hg38 and have been unable to do so with the master branch. I was looking through the comments and saw that a couple of months ago @pblaney recommended hg38_ChrNotation_fix_NAP but he still had issues.
Is there a stable version for hg38? What branch should I clone?

Move creation of segmented LogR to separate function

segmented LogR is now created within fitcopynumber. This means it is redone each time a copy number profile is (re)fitted. Moving this step to a separate function (or making it part of the segmentation output) would reduce runtime there. This could possibly be combined with moving the last two steps to only use the segmented data, removing the need for the raw data to be read in (ticket #5).

Read in data with readr package

The readr package is quite a bit quicker when reading in data. Part of this replace should be to put all the data parsers into a separate R file.

infile = "3e8a2c90-e747-4a22-bc9e-0b062479fec2_alleleFrequencies_chr4.txt"
system.time(read.table(infile, header=T, stringsAsFactor=F))
   user  system elapsed 
 18.045   0.364  18.562
system.time(read.delim(infile, header=T, stringsAsFactor=F, sep="\t"))
   user  system elapsed 
 22.165   0.744  22.872
library(readr)
system.time(read_tsv(infile, col_names=T))
   user  system elapsed 
 12.605   0.108  12.724

Error in gc.correct.wgs

Hello,

At some point in my Battengerg run I get the following error:

Error in [.data.frame(GC_data, , 2 + maxGCcol_short) :
undefined columns selected
Calls: gc.correct.wgs -> [ -> [.data.frame
In addition: Warning message:
In cor(GC_data[flag_nona, 3:ncol(GC_data)], Tumor_LogR[flag_nona, :
the standard deviation is zero

I was wondering if you could please help me to debug the error.

Many thanks
Jorge

only 0 for all count values

Hi,

For some reasons I am only getting 0s at tall the positions for all the bases in all the [_alleleFrequencies_chr.txt] files. My reads are mapped to hg19, aren't the coordinates in ref files for hg38 by any chance? Or I am not really sure what I might be doing wrong.

Best
Mo

correspondence between _A columns in the _subclones.txt and BattenbergProfile_subclones.png

Hello,

I was wondering if you could please help me to interpret the Battenberg output, which has given me two subclones. I understand the _A columns in the _subclones.txt match somehow segmetation plot in BattenbergProfile_subclones.png.

As I have two subclones, am I right saying that the columns nMaj1_A, nMin1_A contain the CN segments for one clone and the columns nMaj2_A, nMin2_A contain the CN for the other clone?

Then in the BattenbergProfile_subclones.png I see two different colours (as in ASCAT) and then thick and thin lines. Do the thick lines corrrespond to one clone and the thin ones to the other?

Thanks so much
Jorge

Callsubclones error

Hi there,

Getting an error in the callsubclones step:
Error in t.test.default(logR[logR$Chromosome == subclones$chr[i - 1] & :
not enough 'x' observations
Calls: callSubclones -> merge_segments -> t.test -> t.test.default

Not sure if it is related, but the fitcn step mentions that 'reference segment did not provide a possible solution':
[1] "ploidy=1.78246500278682,rho=0.81,goodness=84.6365342478127,percentzero=0.292220227244316, perczerAbb=0.292220227244316"
[1] "ploidy=3.51630145168029,rho=0.68,goodness=77.8766478542867,percentzero=0.289668502443199, perczerAbb=0.289668502443199"
[1] "goodnessOfFit from grid=0.52666849231385"
[1] "reference segment did not provide a possible solution"

I have this pipeline working for the majority of my samples, but would like to know if there is a way to resolve these sorts of issues. It results in the finalise step failing as it can't find the subclones.txt file. Are these samples for which there is simply no solution?

Any help would be greatly appreciated! Thanks!

number of 'X' in sunrise plot and their correspondence to subclones.txt file

Hello,

I have done different Battenberg runs where I get either one or two 'X' in the sunrise plot. I was wondering if you could help me to understand how the number of 'X' links to the solutions in the subclones.txt.

Specifically I have two questions, any advice on those will be much appreciated.

  1. Please correct me if I am wrong. My understanding is that a sunrise plot with a single 'X' means that there is a single tumour cell population. Then a sunrise plot with two 'X' means that there are two tumour cell populations corresponding to two states of any of the possible solutions in the subclones.txt file.

  2. I see however that even if there is a single 'X' in the sunrise plot the solutions in the subclones.txt include two states. Does this mean that one of the stages is negligible?

I have attached a png file showing two sunrise plots with one and two 'X'. In the plot with a single 'X' I am also showing the two states of solution A.
Screenshot 2020-04-10 at 10 55 45

Many thanks
Jorge

reference files for GRCh38

Hello,

I was wondering if there is a Battenberg reference area for GRCh38, or a procedure to generate the relevant files.

Many thanks
Jorge

Dockerfile fails at instillation of R packages

Hello,

In an effort to implement Battenberg into a WGS pipeline that is deployable across various systems, one of the first issues I encountered was the failing build recipe used in the current Dockerfile.
When following the directions in the README, this is the outcome from the build command:


$ docker build -t battenberg:2.2.9 .
[+] Building 334.6s (13/21)
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 3.58kB 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/ubuntu:16.04 1.2s
=> [auth] library/ubuntu:pull token for registry-1.docker.io 0.0s
=> [internal] load build context 0.1s
=> => transferring context: 1.43MB 0.1s
=> [ 1/16] FROM docker.io/library/ubuntu:16.04@sha256:e74994b7a9ec8e2129cfc6a871f3236940006ed31091de355578492ed140 4.2s
=> => resolve docker.io/library/ubuntu:16.04@sha256:e74994b7a9ec8e2129cfc6a871f3236940006ed31091de355578492ed140a3 0.0s
=> => sha256:0ba7bf18aa406cb7dc372ac732de222b04d1c824ff1705d8900831c3d1361ff5 527B / 527B 0.2s
=> => sha256:e74994b7a9ec8e2129cfc6a871f3236940006ed31091de355578492ed140a39c 1.42kB / 1.42kB 0.0s
=> => sha256:edf232ee7dc18c57c063bc83533ef2c03c6dfae55a0124f7d372aed51cd1d9c8 1.15kB / 1.15kB 0.0s
=> => sha256:8185511cd5ad68f14aee2bac83a449a6eea2be06f0a4715b008cfe19f07a64f7 3.36kB / 3.36kB 0.0s
=> => sha256:4007a89234b4f56c03e6831dc220550d2e5fba935d9f5f5bcea64857ac4f4888 45.96MB / 45.96MB 1.2s
=> => sha256:5dfa26c6b9c9d1ccbcb1eaa65befa376805d9324174ac580ca76fdedc3575f54 852B / 852B 0.3s
=> => sha256:4c6ec688ebe374ea7d89ce967576d221a177ebd2c02ca9f053197f954102e30b 169B / 169B 0.3s
=> => extracting sha256:4007a89234b4f56c03e6831dc220550d2e5fba935d9f5f5bcea64857ac4f4888 2.2s
=> => extracting sha256:5dfa26c6b9c9d1ccbcb1eaa65befa376805d9324174ac580ca76fdedc3575f54 0.0s
=> => extracting sha256:0ba7bf18aa406cb7dc372ac732de222b04d1c824ff1705d8900831c3d1361ff5 0.0s
=> => extracting sha256:4c6ec688ebe374ea7d89ce967576d221a177ebd2c02ca9f053197f954102e30b 0.0s
=> [ 2/16] RUN apt-get update && apt-get install -y libxml2 libxml2-dev libcurl4-gnutls-dev r-cran-rgl git libssl 60.3s
=> [ 3/16] RUN mkdir /tmp/downloads 0.3s
=> [ 4/16] RUN curl -sSL -o tmp.tar.gz --retry 10 https://github.com/samtools/htslib/archive/1.7.tar.gz && mk 33.9s
=> [ 5/16] RUN curl -sSL -o tmp.tar.gz --retry 10 https://github.com/cancerit/alleleCount/archive/v4.0.0.tar.gz && 1.7s
=> [ 6/16] RUN curl -sSL -o tmp.tar.gz --retry 10 https://mathgen.stats.ox.ac.uk/impute/impute_v2.3.2_x86_64_stati 4.1s
=> [ 7/16] RUN R -q -e 'source("http://bioconductor.org/biocLite.R"); biocLite(c("gtools", "optparse", "devtools 228.1s
=> ERROR [ 8/16] RUN R -q -e 'devtools::install_github("Crick-CancerGenomics/ascat/ASCAT")' 0.6s

[ 8/16] RUN R -q -e 'devtools::install_github("Crick-CancerGenomics/ascat/ASCAT")':
#12 0.546 > devtools::install_github("Crick-CancerGenomics/ascat/ASCAT")
#12 0.547 Error in loadNamespace(name) : there is no package called 'devtools'
#12 0.547 Calls: :: ... tryCatch -> tryCatchList -> tryCatchOne ->
#12 0.547 Execution halted
executor failed running [/bin/sh -c R -q -e 'devtools::install_github("Crick-CancerGenomics/ascat/ASCAT")']: exit code: 1


After troubleshooting, the issue is the R packages to be installed via Bioconductor fail. I believe this is due to the base R version that comes with the Ubuntu 16.04 base image is too far out of date for some of the packages and their internal dependencies.
I have attached an edited version of the Dockerfile that has a successful build recipe (NOTE: the file was given a '.txt' extension for uploading purposes and to be used this extension must be removed). The changes being updating the OS base image to Ubuntu 20.04 and using Bioconductor's BiocManager installation method.
Also, I think it should be clearly stated that this build does not work with any hg38 aligned data or the newly provided hg38 reference data. This will be brought up in subsequent issues.

Dockerfile.txt

Docker Engine specs:


$ docker version
Client: Docker Engine - Community
Cloud integration: 1.0.9
Version: 20.10.5
API version: 1.41
Go version: go1.13.15
Git commit: 55c4c88
Built: Tue Mar 2 20:13:00 2021
OS/Arch: darwin/amd64
Context: default
Experimental: true

Server: Docker Engine - Community
Engine:
Version: 20.10.5
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 363e9a8
Built: Tue Mar 2 20:15:47 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.19.0
GitCommit: de40ad0


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.