pughlab / consensuscruncher Goto Github PK

ConsensusCruncher is a tool that suppresses errors in next-generation sequencing data by using unique molecular identifiers (UMIs) to amalgamate reads derived from the same DNA template into a consensus sequence.

Home Page: https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz474/5498633

License: Other

Shell 0.97% Python 4.81% Jupyter Notebook 94.23%

consensus-sequences fastq-files error-suppression molecular-barcodes

consensuscruncher's Introduction

ConsensusCruncher

ConsensusCruncher is a tool that suppresses errors in next-generation sequencing data by using unique molecular identifers (UMIs) to amalgamate reads derived from the same DNA template into a consensus sequence.

To learn more about ConsensusCruncher and its applications: https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz474/5498633

For a full documentation of ConsensusCruncher, please see our Read the Docs

Quick start

Dependencies

This pipeline requires the following dependencies:

Program	Version	Purpose
picrd	picard/1.9.1 or higher	Adding read group @RG in the alignment files
Python	3.5.1	Run ConsensusCruncher
BWA	0.7.15	Align reads
Samtools	1.3.1	Sorting and indexing bamfiles

All required python libraries can be installed by running pip install -r requirements.txt

Configuration

Set up config.ini with the appropriate configurations for [fastq2bam] and [consensus] modes. Alternatively, you can provide command-line arguments.

ConsensusCruncher.py processes one sample (2 paired-end FASTQ files or 1 BAM file) at a time. A sample script to generate shell scripts for multiple samples is available here.

Running ConsensusCruncher

Individual files

Run ConsensusCruncher.py [-c CONFIG] fastq2bam with required input parameters:

  --fastq1 FASTQ1       FASTQ containing Read 1 of paired-end reads. [MANDATORY]
  --fastq2 FASTQ2       FASTQ containing Read 2 of paired-end reads. [MANDATORY]
  -o OUTPUT, --output OUTPUT
                        Output directory, where barcode extracted FASTQ and
                        BAM files will be placed in subdirectories 'fastq_tag'
                        and 'bamfiles' respectively (dir will be created if
                        they do not exist). [MANDATORY]
  -n FILENAME, --name FILENAME
                        Output filename. If none provided, default will
                        extract output name by taking everything left of '_R'.
  -b BWA, --bwa BWA     Path to executable bwa. [MANDATORY]
  -r REF, --ref REF     Reference (BWA index). [MANDATORY]
  -s SAMTOOLS, --samtools SAMTOOLS
                        Path to executable samtools [MANDATORY]
  -p PATTERN, --bpattern PATTERN
                        Barcode pattern (N = random barcode bases, A|C|G|T =
                        fixed spacer bases). [Pattern or list must be provided]
  -l LIST, --blist LIST
                        List of barcodes (Text file with unique barcodes on
                        each line). [Pattern or list must be provided]

BARCODE DESIGN: You can input either a barcode list or barcode pattern or both. If both are provided, barcodes will first be matched with the list and then the constant spacer bases will be removed before the barcode is added to the header. N = random / barcode bases A | C | G | T = constant spacer bases e.g. ATNNGT means barcode is flanked by two spacers matching 'AT' in front and 'GT' behind.

DESCRIPTION: This script extracts molecular barcode tags and removes spacers from unzipped FASTQ files found in the input directory (file names must contain "R1" or "R2"). Barcode extracted FASTQ files are written to the 'fastq_tag' directory and are subsequently aligned with BWA mem. Bamfiles are written to the 'bamfile" directory under the project folder.

Run ConsensusCruncher.py [-c CONFIG] consensus with the required input parameters:

  -h, --help            show this help message and exit
  -i BAM, --input BAM   Input BAM file with barcodes extracted into header. [mandatory]
  -o OUTPUT, --output OUTPUT
                        Output directory, where a folder will be created for
                        the BAM file and consensus sequences. [mandatory]
  -s SAMTOOLS, --samtools SAMTOOLS
                        Path to executable samtools. [mandatory]
  --scorrect {True,False}
                        Singleton correction, default: True.
  -b BEDFILE, --bedfile Separator file to split bamfile into chunks for processing.
                        Default: hg19 cytoband (You can find other cytobands for your 
                        genome of interest on UCSC
                        http://hgdownload.cse.ucsc.edu/downloads.html).
                        For small BAM files, you may choose to turn off data splitting 
                        with '-b False' and process everything all at once (Division of 
                        data is only required for large data sets to offload the
                        memory burden).
  --cutoff CUTOFF       Consensus cut-off, default: 0.7 (70% of reads must
                        have the same base to form a consensus).
  --cleanup {True,False}
                        Remove intermediate files.

This script amalgamates duplicate reads in bamfiles into single-strand consensus sequences (SSCS), which are subsequently combined into duplex consensus sequences (DCS). Singletons (reads lacking duplicate sequences) are corrected, combined with SSCS to form SSCS + SC, and further collapsed to form DCS + SC. Finally, files containing all unique molecules (a.k.a. no duplicates) are created for SSCS and DCS.

Multiple files

script generator will create sh scripts for each file in a fastq directory.

The following parameters need to be changed in the config file: name, bwa, ref, samtools, bpattern (alternatively if a barcode list is used instead, remove bpattern and add blist as parameter). Please note: fastq1, fastq2, output, bam, and c_output can be ignored as those will be updated using the generate_scripts.sh file.
Update generate_scripts.sh with input, output, and code_dir.
Run generate_scripts.sh to create sh files and then run those scripts.

Overview

Example

In order to create consensus sequences, we first need to process fastq files into bam files. Sample fastq files can be found under the test folder. Please note these fastqs are only for testing purposes. For the full fastqs used in our paper, please download the data from the NCBI Sequence Read Archive (SRA; https://www.ncbi.nlm.nih.gov/sra/) under access numbers SRP140497 and SRP141184.

Fastqs to Bams

Given fastq as input files, fastq2bam mode removes the spacer region and extracts the barcode tag from each sequencing read into the header with extract_barcode.py.

REPO="[insert path to ConsensusCruncher repo]"
BWAPATH="[insert path to BWA]"
BWAINDEX="[insert path to BWA INDEX]"
BWAPATH="[insert path to SAMTOOLS]"

python ConsensusCruncher.py fastq2bam --fastq1 $REPO/test/fastq/LargeMid_56_L005_R1.fastq --FASTQ2 $REPO/test/fastq/LargeMid_56_L005_R2.fastq -o $REPO/test -b $BWAPATH -r $BWAIndex -s $SAMTOOLS -bpattern NNT

In the sample dataset, we utilized 2-bp (NN) barcodes and 1-bp (T) spacers. While the barcodes for each read can be one of 16 possible combinations (4^2), the spacer is an invariant "T" base used to ligate barcodes onto each end of a DNA fragment. Thus, a spacer filter is imposed to remove faulty reads. Barcodes from read 1 and read 2 are extracted and combined together before being added to the header.

READ FROM SEQUENCER
Read1:
@HWI-D00331:196:C900FANXX:5:1101:1332:2193 1:N:0:ACGTCACA   [<-- HEADER]
ATTAAGCCCCAGGCAGTTGCTAATGATGGGAGCTTAGTGCACAAGGGCTGGGCCTCCCTCTTGGAGCTGAACATTGTTTCTTGGGGACGGCTGTGCCCACCTCAGCGGGGAGGCAAGGATTAAATC  [<-- SEQUENCE]
+
BCCCCGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGEGGGGGBGGGGGGGGGGGGGGGGGGGGGGGEGG1:FGFGGGGGGGGG/CB>DG@GGGGGGG<DGGGGAAGGEGGB>DGGGEGGG/@G  [<-- QUALITY SCORE]

Read2:
@HWI-D00331:196:C900FANXX:5:1101:1332:2193 2:N:0:ACGTCACA
GGTGGGCTCCAGCCCTGATTTCCTCCCCCAGCCCTGCAGGGCTCAGGTCCAGAGGACACAAGTTTAACTTGCGGGTGGTCACTTGCCTCGTGCGGTGACGCCATGGTGCCCTCTCTGTGCAGCGCA
+
BBBBCGGGGEGGGGFGGGGGGGGGGGGGGGGGGGGGGB:FCGGGGGGGGGGEGGGGGGGG=FCGG:@GGGEGBGGGAGFGDE@FGGGGGFGFGEGDGGGFCGGDEBGGGGGGGEG=EGGGEEGGG#

------

AFTER BARCODE EXTRACTION AND SPACER ("T") REMOVAL
Read1:
@HWI-D00331:196:C900FANXX:5:1101:1332:2193|ATGG/1
AAGCCCCAGGCAGTTGCTAATGATGGGAGCTTAGTGCACAAGGGCTGGGCCTCCCTCTTGGAGCTGAACATTGTTTCTTGGGGACGGCTGTGCCCACCTCAGCGGGGAGGCAAGGATTAAATC
+
CCGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGEGGGGGBGGGGGGGGGGGGGGGGGGGGGGGEGG1:FGFGGGGGGGGG/CB>DG@GGGGGGG<DGGGGAAGGEGGB>DGGGEGGG/@G

Read2:
@HWI-D00331:196:C900FANXX:5:1101:1332:2193|ATGG/2
GGGCTCCAGCCCTGATTTCCTCCCCCAGCCCTGCAGGGCTCAGGTCCAGAGGACACAAGTTTAACTTGCGGGTGGTCACTTGCCTCGTGCGGTGACGCCATGGTGCCCTCTCTGTGCAGCGCA
+
BCGGGGEGGGGFGGGGGGGGGGGGGGGGGGGGGGB:FCGGGGGGGGGGEGGGGGGGG=FCGG:@GGGEGBGGGAGFGDE@FGGGGGFGFGEGDGGGFCGGDEBGGGGGGGEG=EGGGEEGGG#

FASTQ files with extracted barcodes are placed in the fastq_tag directory and are subsequently aligned with BWA to generate BAMs in the bamfiles folder.

. 
├── bamfiles 
├── fastq
├── fastq_tag
└── qsub

ConsensusCruncher

consensus mode creates a consensus directory and folders for each bam file.

BAM files undergo consensus construction through the workflow illustrated above. Output BAMs are grouped according to type of error suppression (SSCS vs DCS) and whether Singleton Correction (SC) was implemented.

. 
├── bamfiles 
├── consensus 
│   ├── LargeMid_56_L005 
│   │   ├── dcs 
│   │   ├── dcs_SC 
│   │   ├── sscs 
│   │   └── sscs_SC 
... 
│   ├── LargeMid_62_L006
│   │   ├── dcs
│   │   ├── dcs_SC
│   │   ├── sscs
│   │   └── sscs_SC
│   └── qsub
├── fastq
├── fastq_tag
└── qsub

Within a sample directory (e.g. LargeMid_56_L005), you will find the following files:

Please note the example below is for illustrative purposes only, as sample names and index files were removed for simplification. Order of directories and files were also altered to improve comprehension.

.                                           Filetype
├── sscs
│   ├── badReads.bam                        Reads that are unmapped or have multiple alignments
│   ├── sscs.sorted.bam                     Single-Strand Consensus Sequences (SSCS)
│   ├── singleton.sorted.bam                Single reads (Singleton) that cannot form SSCSs
├── sscs_SC
|   ├── singleton.rescue.sorted.bam         Singleton correction (SC) with complementary singletons
|   ├── sscs.rescue.sorted.bam              SC with complementary SSCSs
|   ├── sscs.sc.sorted.bam                  SSCS combined with corrected singletons (from both rescue strategies)   [*]
|   ├── rescue.remaining.sorted.bam         Singletons that could not be corrected
|   ├── all.unique.sscs.sorted.bam          SSCS + SC + remaining (uncorrected) singletons
├── dcs
│   ├── dcs.sorted.bam                      Duplex Consensus Sequence (DCS)
│   ├── sscs.singleton.sorted.bam           SSCSs that could not form DCSs as complementary strand was missing  
├── dcs_SC
│   ├── dcs.sc.sorted.bam                   DCS generated from SSCS + SC    [*]
│   ├── sscs.sc.singleton.sorted.bam        SSCS + SC that could not form DCSs 
│   ├── all.unique.dcs.sorted.bam           DCS (from SSCS + SC) + SSCS_SC_Singletons + remaining singletons
├── read_families.txt                       Family size and frequency
├── stats.txt                               Consensus sequence formation metrics
├── tag_fam_size.png                        Distribution of reads across family size
└── time_tracker.txt                        Time log

Through each stage of consensus formation, duplicate reads are collapsed together and single reads are written as separate files. This allows rentention of all unique molecules, while providing users with easy data management for cross-comparisons between error suppression strategies.

To simplify analyses, it would be good to focus on SSCS+SC ("sscs.sc.sorted.bam") and DCS+SC ("dcs.sc.sorted.bam") as highlighted above with [*].

How it works

Unique molecular identifiers (UMIs) composed of molecular barcodes and sequence features are used aggregate reads derived from the same strand of a template molecule. Amalgamation of such reads into single strand consensus sequences (SSCS) removes discordant bases, which effectively eliminates polymerase and sequencer errors. Complementary SSCSs can be subsequently combined to form a duplex consensus sequence (DCS), which eliminates asymmetric strand artefacts such as those that develop from oxidative damage.

Conventional UMI-based strategies rely on redundant sequencing from both template strands to form consensus sequences and cannot error suppress single reads (singleton). We enable singleton correction using complementary duplex reads in the absence of redundant sequencing.

ConsensusCruncher schematic:

An uncollapsed bamfile is first processed through SSCS_maker.py to create an error-suppressed single-strand consensus sequence (SSCS) bamfile and an uncorrected singleton bamfile.
The singletons can be corrected through singleton_correction.py, which error suppress singletons with its complementary SSCS or singleton read.
SSCS reads can be directly made into duplex consensus sequences (DCS) or merged with corrected singletons to create an expanded pool of DCS reads (Figure illustrates singleton correction merged work flow).

Issues and feature requests

Please use this repository templates available at .github/ISSUE_TEMPLATE

Who do I talk to?

Jeff Bruce ([email protected]), Trevor Pugh ([email protected]), Scott Bratman ([email protected])

consensuscruncher's People

Contributors

Stargazers

Watchers

Forkers

khzhu quevedor2 oicr-gsi silenwang schaudge mawusseagbessi biosoft kridel-lab wangdi2016 andy-captain

consensuscruncher's Issues

Prefer `os.subprocess.call()` to `os.system()`

Describe the bug

os.system() presents security vulnerabilities. The Python docs indicate that using the subprocess module is preferred.

To Reproduce

See 5 instances of os.system() in ConsensusCruncher.py

Additional considerations

Existing os.subprocess.call invocations in ConsensusCruncher.py create a string and then split it on spaces. This could be dangerous if any of the strings being passed in to the .format() call contain spaces. This could be mitigated by converting lines like this:

call("{} index {}/{}.sorted.bam".format(args.samtools, bam_dir, filename).split(' '))

To:

call([args.samtools, "index", "{}/{}.sorted.bam".format(bam_dir, filename)])

Add picard command in the ConsensusCruncher.py

Is your feature request related to a problem? Please describe.

Removing the readgroup arguments in the bwa aligner command and add Picard AddReplaceReadgroups after the bam file is created.

Describe the solution you'd like
java -jar picard.jar AddOrReplaceReadGroups RGID=4
RGLB=lib1
RGPL=ILLUMINA
RGPU=unit1
RGSM=20

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

barcode overlap checking needs to limit overlap from start

extract_barcode.py

for barcode in blist:
if sum([barcode in b for b in blist]) > 1:
overlap = True

"startswith" is better than "in": if sum([b.startswith(barcode) b for b in blist]) > 1
For example, ACG and TACG are not overlapped.

The resulting bam file has improper sample name

When using gatk GetSampleName on the resulting bam files, the sample names got are actually fastq file name instead of sample name. These will cause problem for downstram analysis when the incorrect sample name passed down to for example vcf header.

Import modules instead of shelling out to them

Is your feature request related to a problem? Please describe.
There are a few instances where commands are created by something like the following:

extractb_cmd = "{}/ConsensusCruncher/extract_barcodes.py --read1 {} --read2 {} --outfile {} --bpattern {} "
        "--blist {}".format(code_dir, args.fastq1, args.fastq2, outfile, args.bpattern, args.blist)
os.system(extractb_cmd)

Describe the solution you'd like
It would be clearer to import the module and use the functions within it, something like:

from extract_barcodes import main_function

main_function(read1=args.fastq1, read2=args.fast2, outfile=outfile, bpattern=args.bpattern, blist=args.blist)

Clarify link between genome and bedfile

Is your feature request related to a problem? Please describe.
If the user chooses genome == 'hg38', the bedfile is automatically set to the included hg38_cytoBand.txt file which will silently overwrite any args.bedfile value the user has chosen.

Describe the solution you'd like
Only overwrite args.bedfile if this value is not already specified by the user.

Describe alternatives you've considered
Update the documentation to make very clear that if hg38 is selected, it will use the inbuilt bedfile.

Overlapping barcodes are not clearly well-defined.

Is your feature request related to a problem? Please describe.
When you use a list of barcodes, you may end up with this error:

ValueError: There are overlapping barcodes in the list (difficult to determine which barcode is correct).

The error itself is not very clear. You need to find the corresponding python script and go inside the code to understand what it means exactly, in particular:

https://github.com/pughlab/ConsensusCruncher/blob/master/ConsensusCruncher/extract_barcodes.py

You will find the corresponding error here:

elif check_overlap(blist):raise ValueError("There are overlapping barcodes in the list (difficult to determine which barcode is correct).")

The function check_overlap(blist) gives a better understanding of the signification of the error:

def check_overlap(blist):
"""(list) -> bool
Return boolean indicating whether or not there's overlapping barcodes within the list.
check_overlap(['AACT', 'AGCT'])
False
check_overlap(['AACTCT', 'AACT'])
True
"""
overlap = False
for barcode in blist:
if sum([barcode in b for b in blist]) > 1:
overlap = True
return overlap

In other words, overlapping barcodes do not mean exact duplicates! It's a bit more subtile than this.
AACTCT and AACT will be considered as overlapping barcodes because AACT is a substring of AACTCT.

Describe the solution you'd like
Explicit better what overlapping barcodes mean and if there are any, remove them automatically within the code before running Consensus Cruncher.

Describe alternatives you've considered
To fix this error, you need to verify that all your barcodes are "unique" in the sense that they are not a substring of another barcode.
So I wrote my own script to remove all those "overlapping barcodes" before running Consensus Cruncher.

support variable spacer sequences

Is your feature request related to a problem? Please describe.
I'm analysing libraries that have 5bp UMIs followed by a random 2bp spacer that doesn't have to end in T, whereas I think ConsensusCruncher only supports UMI's with known/fixed spacer sequences that must end in T.

Describe the solution you'd like
I'd like to be able to provide bpattern=NNNNNSS, so that NNNNN is read as the UMI, and SS is skipped as the random spacer. The blist would be a file with all of the 5bp UMI sequences in it (as currently supported).

Describe alternatives you've considered
fgbio supports a read structure of 5M2S+T (5bp UMI matches, 2 bp spacer, remaining bp template) to do this job, but fgbio doesn't rescue singleton reads.

Error on LargeMid_56_L005_R{1,2}.fastq

Hi all,
I'm testing ConsensusCruncher on the given test data: ConsensusCruncher/test/LargeMid_56_L005_R{1,2}.fastq.
But some errors occurred when it came to the consensus step, and below is the output of consensus:

/data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/SSCS_maker.py --infile LargeMid/bamfiles/LargeMid_56_L005_R1.fastq.sorted.bam --outfile LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted.sscs.bam --cutoff 0.7 --bedfile /data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt
# === SSCS ===
Uncollapsed - Total reads: 19259
Uncollapsed - Unmapped reads: 15
Uncollapsed - Secondary/Supplementary reads: 23
SSCS reads: 0
Singletons: 19216
Bad spacers: 0

# QC: Total uncollapsed reads should be equivalent to mapped reads in bam file.
Total uncollapsed reads: 19259
Total mapped reads in bam file: 19564
QC: check dictionaries to see if there are any remaining reads
=== pair_dict remaining ===
HWI-D00331:196:C900FANXX:5:1101:13274:2385|CG.TG
read remaining:
HWI-D00331:196:C900FANXX:5:1101:13274:2385|CG.TG	161	7	114597515	60	123M	27	128565	123	CATATGCCATGCACATAAAATGTTATTTATATATTTATTGGTTAAATGAATTAACATTTAAATATTGGCATCGTAAGTGAATAAGTATTCAGTATCTTTGTAATCAATGGGTAACTCATGCTT	array('B', [15, 25, 34, 33, 37, 16, 37, 35, 37, 38, 38, 38, 38, 38, 38, 38, 34, 37, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 34, 37, 38, 38, 38, 38, 38, 38, 38, 37, 33, 37, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 34, 38, 38, 36, 36, 38, 38, 37, 35, 38, 38, 38, 36, 38, 38, 31, 35, 38, 38, 38, 38, 38, 38, 38, 36, 38, 36, 35, 35, 36, 38, 38, 33, 37, 38, 38, 38, 38, 38, 35, 38, 34, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38])	[('NM', 0), ('MD', '123'), ('AS', 123), ('XS', 20), ('RG', '1')]
mate:
HWI-D00331:196:C900FANXX:5:1101:13274:2385|CG.TG	81	27	128565	0	123M	7	114597515	123	TTGGGAGTTGGCCTTACTGGGTATCTGTAAGAACAGGGAAAAGGACACGCACCTGGCCTGTGGTGGTTACTTCTTTCTGAATCGTGTCAGAGAACTTGGCTGCTCTGGAAGAGCCAGTTTTGT	array('B', [35, 38, 38, 38, 38, 38, 38, 28, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 36, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 36, 38, 35, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 33, 34, 34])	[('NM', 0), ('MD', '123'), ('AS', 123), ('XS', 123), ('RG', '1'), ('XA', 'chr16,-70923438,123M,0;')]
HWI-D00331:196:C900FANXX:5:1101:15070:3473|TC.GT
read remaining:
HWI-D00331:196:C900FANXX:5:1101:15070:3473|TC.GT	129	4	49985033	60	123M	27	359773	123	GAAGTCCAATTTATCTTCTTTTTTTAAAATCTGTGCCTCATCTGCAAATATTACCAATCGAAAGTCATGAAATTTTTCCCCTAAGATTTTATAGTTTTAGCGCTTACGTTTGGGTCTTTGATC	array('B', [33, 33, 38, 38, 38, 37, 38, 38, 38, 38, 38, 37, 34, 38, 36, 37, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 36, 38, 38, 37, 36, 38, 38, 36, 38, 38, 37, 36, 37, 38, 38, 38, 38, 38, 38, 38, 38, 34, 38, 38, 38, 38, 38, 38, 36, 31, 31, 29, 33, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 33, 35, 38, 38, 38, 38, 38, 38, 38, 38, 38, 15])	[('NM', 0), ('MD', '123'), ('AS', 123), ('XS', 22), ('RG', '1')]
mate:
HWI-D00331:196:C900FANXX:5:1101:15070:3473|TC.GT	65	27	359773	2	123M	4	49985033	123	GACACCACACTTCATGCTCTGGGTGCCTGGTAACCTGAGTTTACCACTTGGAGGAGGTCACTACCTAAAATGTCGCAGTAAATGGTCTGTTGATAGAGCTTGGCTTCTAGTGGGTTAAAGTAC	array('B', [34, 34, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 34, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 36, 38, 38, 38, 38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 36, 38, 37, 38, 37, 37, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 35, 38, 38, 38, 38, 38, 36, 34])	[('NM', 0), ('MD', '123'), ('AS', 123), ('XS', 121), ('RG', '1'), ('XA', 'chr16,+71149535,123M,1;')]
HWI-D00331:196:C900FANXX:5:1101:18338:2852|GT.CC
read remaining:
HWI-D00331:196:C900FANXX:5:1101:18338:2852|GT.CC	97	6	140573779	0	123M	60	145190	123	TGGACGCCAACGACAACTCGCCCTTCGTGCTGTACCCGCTGCAGAACGGCTCCGCGCCCTGCACCGAGCTGGTGCCCCGGGCGGCCGAGCCGGGCTACCTGGTGACCAAGGTGGTGGCGGTGG	array('B', [33, 33, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 26, 35, 38, 38, 36, 38, 38, 31, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 28, 38, 38, 38, 31, 35, 38, 13, 35, 35, 38, 38, 38, 38, 38, 38])	[('NM', 0), ('MD', '123'), ('AS', 123), ('XS', 123), ('RG', '1')]
mate:
HWI-D00331:196:C900FANXX:5:1101:18338:2852|GT.CC	145	60	145190	3	21S80M22S	6	140573779	80	TGGCCAAGCACAGGCTAGTGTTGGGTGATCAATGCAGAAATATGTCACAATGCCCCCTTAGGCAGAGCCTAGACAAAAGCCCCATCACCTGGATGATCAGTACAGGGTTATGTCAAAAAGTTA	array('B', [34, 38, 38, 35, 36, 38, 38, 38, 36, 37, 38, 38, 38, 38, 37, 38, 38, 37, 37, 38, 38, 38, 38, 38, 38, 35, 29, 38, 38, 38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 31, 33, 37, 34, 38, 38, 38, 38, 38, 38, 35, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 36, 38, 38, 38, 38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 35, 38, 38, 38, 38, 37, 29, 35, 37, 34, 35, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 34, 38, 38, 38, 38, 38, 37, 38, 38, 37, 38, 38, 38, 38, 38, 32, 33])	[('NM', 1), ('MD', '10G69'), ('AS', 75), ('XS', 72), ('RG', '1'), ('XA', 'chrUn_gl000216,-154629,24S77M22S,1;chrUn_gl000225,+92868,2S89M32S,4;chrUn_gl000225,+117160,22S77M24S,2;chrUn_gl000225,-21045,23S92M8S,6;')]
HWI-D00331:196:C900FANXX:5:1101:2357:3051|TG.CA
read remaining:
HWI-D00331:196:C900FANXX:5:1101:2357:3051|TG.CA	113	20	40392124	0	123M	64	130694	123	CTGAGAGCCGAGCAGCCCAGGGAGCAGGTGTCCGCACAGAGCTCGTAGTGACTGTTCTGAGGGCATTCCATGGCTGCAAGGAGGGGGTGCCGATCAGAGCCCTGGGGAGGGAGGGGCTGCAAG	array('B', [38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 34, 35, 38, 33, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 34, 38, 34, 38, 33, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 34, 34])	[('NM', 0), ('MD', '123'), ('AS', 123), ('XS', 122), ('RG', '1'), ('XA', 'chr19,-40376440,123M,1;')]
mate:
HWI-D00331:196:C900FANXX:5:1101:2357:3051|TG.CA	177	64	130694	8	123M	20	40392124	123	ACACGTCACCCATAAGTGTGTGTTCCCGTGAGGAGAGATTTCTAAGAAATGGCACTGTACACTGAACGCAGTGGCTCACGTCTGTCATCCCGAGGTCAGGAGTTCGAGACCAGCCCGGCCAAC	array('B', [2, 34, 31, 38, 38, 38, 38, 38, 38, 38, 28, 38, 37, 37, 36, 33, 38, 38, 36, 38, 38, 38, 38, 27, 38, 38, 38, 38, 38, 38, 38, 38, 38, 36, 38, 37, 29, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 37, 38, 38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 34, 38, 38, 34, 37, 35, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 34, 37, 31, 37, 38, 34, 37, 38, 38, 31, 34, 37, 37, 29, 38, 38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 33, 31])	[('NM', 2), ('MD', '35T16T70'), ('AS', 113), ('XS', 108), ('RG', '1'), ('XA', 'chrUn_gl000220,-126244,123M,3;')]
HWI-D00331:196:C900FANXX:5:1101:1040:2069|TC.AC
read remaining:
HWI-D00331:196:C900FANXX:5:1101:1040:2069|TC.AC	129	20	41630129	17	36S18M1D28M41S	64	129912	46	CTCCCCCCCCCCCCCCCCCCCCTTTCCCCCCTTTTTTTCTTTTTTTTTTTTTCTTTCCCCCCCCCCTTTTTTTTTTTTTTTTCTTTATTACGTAAGAAATATTGGAGTTGGATGAAATTTTTG	array('B', [33, 33, 16, 31, 29, 30, 36, 36, 38, 38, 33, 38, 29, 14, 27, 27, 34, 36, 38, 38, 29, 14, 14, 15, 15, 15, 15, 15, 15, 14, 14, 14, 15, 15, 15, 15, 14, 14, 15, 15, 15, 24, 15, 15, 14, 14, 14, 14, 14, 14, 14, 14, 15, 24, 15, 15, 15, 15, 15, 25, 25, 34, 29, 25, 34, 14, 14, 15, 15, 26, 31, 38, 38, 38, 33, 38, 31, 32, 26, 32, 38, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])	[('NM', 2), ('MD', '15C2^C28'), ('AS', 34), ('XS', 28), ('RG', '1'), ('XA', 'chr14,+55382840,28S28M67S,0;chr2,+61459994,54S28M41S,0;chr11,+76344684,58S28M37S,0;chr3,-59180857,41S28M54S,0;')]
mate:
HWI-D00331:196:C900FANXX:5:1101:1040:2069|TC.AC	65	64	129912	60	123M	20	41630129	123	TGAACACCCCCGTCACAAGTTTACCTATGTCACAGTCTTGCTCATGTATGCTTGAACGACNAATAAAAGTTCGGGGGGGNGAAGAGAGGAGAGAGAGAGAGAGACGGGGAGAGAGGGGGGAGG	array('B', [33, 33, 38, 38, 38, 38, 38, 38, 38, 37, 36, 38, 38, 38, 34, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 34, 38, 38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 2, 28, 15, 27, 37, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 2, 25, 14, 25, 34, 31, 36, 33, 35, 34, 24, 34, 36, 38, 33, 38, 38, 38, 36, 38, 27, 35, 38, 38, 38, 27, 38, 38, 31, 38, 23, 34, 34, 38, 13, 25, 36, 38, 38, 26, 35, 13, 23, 34])	[('NM', 3), ('MD', '60A18A42A0'), ('AS', 118), ('XS', 77), ('RG', '1')]
=== read_dict remaining ===
=== csn_pair_dict remaining ===
Traceback (most recent call last):
  File "/data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/SSCS_maker.py", line 391, in <module>
    main()
  File "/data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/SSCS_maker.py", line 372, in main
    plt.bar(list(tags_per_fam), read_fraction)
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/pyplot.py", line 2639, in bar
    ax = gca()
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/pyplot.py", line 935, in gca
    return gcf().gca(**kwargs)
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/pyplot.py", line 585, in gcf
    return figure()
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/pyplot.py", line 534, in figure
    **kwargs)
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/backends/backend_qt4agg.py", line 46, in new_figure_manager
    return new_figure_manager_given_figure(num, thisFig)
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/backends/backend_qt4agg.py", line 53, in new_figure_manager_given_figure
    canvas = FigureCanvasQTAgg(figure)
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/backends/backend_qt4agg.py", line 76, in __init__
    FigureCanvasQT.__init__(self, figure)
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/backends/backend_qt4.py", line 63, in __init__
    _create_qApp()
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/backends/backend_qt5.py", line 136, in _create_qApp
    raise RuntimeError('Invalid DISPLAY variable')
RuntimeError: Invalid DISPLAY variable
/data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/DCS_maker.py --infile LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted.sscs.sorted.bam --outfile LargeMid/LargeMid_56_L005_R1.fastq.sorted/dcs/LargeMid_56_L005_R1.fastq.sorted.dcs.bam --bedfile /data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt
# === DCS ===
SSCS - Total reads: 0
SSCS - Unmapped reads: 0
SSCS - Secondary/Supplementary reads: 0
DCS reads: 0
SSCS singletons: 0 

/data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/singleton_correction.py --singleton LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted.singleton.sorted.bam --bedfile /data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt
# === Singleton Correction ===
Total singletons: 19216
Singleton Correction by SSCS: 0
% Singleton Correction by SSCS: 0.0
Singleton Correction by Singletons: 4
% Singleton Correction by Singletons : 0.020815986677768527
Uncorrected Singletons: 19212 

0.009413317839304606
/data4/qiumin/soft/ConsensusCruncher/dep/bin/samtools merge LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs_sc/LargeMid_56_L005_R1.fastq.sorted.sscs.sc.bam LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted.sscs.sorted.bam LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs_sc/LargeMid_56_L005_R1.fastq.sorted.sscs.correction.sorted.bam LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs_sc/LargeMid_56_L005_R1.fastq.sorted.singleton.correction.sorted.bam
/data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/DCS_maker.py --infile LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs_sc/LargeMid_56_L005_R1.fastq.sorted.sscs.sc.sorted.bam --outfile LargeMid/LargeMid_56_L005_R1.fastq.sorted/dcs_sc/LargeMid_56_L005_R1.fastq.sorted.dcs.sc.bam --bedfile /data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt
# === DCS - Singleton Correction ===
SSCS SC - Total reads: 4
SSCS SC - Unmapped reads: 0
SSCS SC - Secondary/Supplementary reads: 0
DCS SC reads: 2
SSCS SC singletons: 0 

LargeMid/LargeMid_56_L005_R1.fastq.sorted/dcs_sc/LargeMid_56_L005_R1.fastq.sorted.all.unique.dcs.bam
Traceback (most recent call last):
  File "/data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher.py", line 473, in <module>
    args.func(args)
  File "/data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher.py", line 293, in consensus
    '{}/{}_tag_fam_size.png'.format(sample_dir, identifier))
FileNotFoundError: [Errno 2] No such file or directory: 'LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted_tag_fam_size.png' -> 'LargeMid/LargeMid_56_L005_R1.fastq.sorted/LargeMid_56_L005_R1.fastq.sorted_tag_fam_size.png'`

There seem to be 2 problems:

the program trying to use qt to display something and it failed because I don't have one on my node.
Total uncollapsed reads should be equivalent to mapped reads in bam file, but it does not match

So, are these errors acceptable? And I found that in these 2 files, only 2% of the total reads were corrected by singletons method, but according to your paper, the corrected reads shold be more than 30%. Am I making something wrong?

Thanks,
Silen

Out of memory error

ConsensusCruncher silently runs out of memory mid-bam file and makes SSCS and DCS bam files that only contain reads mapping to the alphanumerically-first chromosomes. Could a solution be to only read individual chromosomes from the bam for ConsensusCrunch'ing and then clear the memory before moving on to the next chromosome?

Allow this library to be installed via pip

Describe the solution you'd like
We'd like to be able to install this library via pip, and have it run ConsensusCruncher.py from the command line.

it would be great if the setup.py file could make use of entry points to allow the helper modules to be run independently from the command line.

Thanks!

Updating instillation requirements

Hey Nina,

I just started using your program, and was wondering if you could update the requirements.txt to explicitly state matplotlib as a dependency? This may help future users.

Thanks :)

IndexError: string index out of range

I was running ConsensusCeuncher to collapse UMIs. It seems to output some results:

├── sscs
│   ├── sscs.sorted.bam (.bai)                     
│   ├── singleton.sorted.bam (.bai)                
├── sscs_SC
|   ├── sscs.sc.sorted.bam (.bai)              
├── dcs
│   ├── dcs.sorted.bam (.bai)
├── dcs_SC
│   ├── dcs.sc.sorted.bam(.bai)
│   ├── all.unique.dcs.sorted.bam(.bai)
├── read_families.txt                       Family size and frequency
├── stats.txt                               Consensus sequence formation metrics
├── tag_fam_size.png                        Distribution of reads across family size

However, when I checked on the log, I found an error:

# === DCS ===
SSCS - Total reads: 26020276
SSCS - Unmapped reads: 0
SSCS - Secondary/Supplementary reads: 0
DCS reads: 89306
SSCS singletons: 25841664 

[bam_sort_core] merging from 6 files and 1 in-memory blocks...
Traceback (most recent call last):
  File "/ConsensusCruncher/singleton_correction.py", line 320, in <module>
    main()
  File "/ConsensusCruncher/singleton_correction.py", line 268, in main
    corrected_read = strand_correction(tag, duplex, query_name, singleton_dict)
  File "/ConsensusCruncher/singleton_correction.py", line 101, in strand_correction
    dcs = duplex_consensus(read, complement_read)
  File "/ConsensusCruncher/singleton_correction.py", line 71, in duplex_consensus
    if read1.query_sequence[i] == read2.query_sequence[i] and \
IndexError: string index out of range
[bam_sort_core] merging from 6 files and 1 in-memory blocks...
# === DCS - Singleton Correction ===
SSCS SC - Total reads: 26023245
SSCS SC - Unmapped reads: 0

It looks like DCS was not properly performed? For my experiments, I might just need to use sscs.sc.sorted.bam. Are these final bam files still safe to use? The following is my commands. Thanks!
python3 ConsensusCruncher.py fastq2bam --fastq1 sample_R1.fastq --fastq2 sample_R2.fastq -o out_dir -b bwa -g /picard/2.10.9/picard.jar -r hg38.fasta -s samtools -l umilist.txt

python3 ConsensusCruncher.py consensus -i sample.sorted.bam -o out_dir -s samtools -b cytoBand.txt -g hg38 --cleanup True --scorrect True

name 'overlap' is not defined in extract_barcodes.py

Describe the bug

name 'overlap' in line 79 is not defined in extract_barcodes.py and a DeprecationWarning for 'U' mode in line 156 and 157

To Reproduce

Steps to reproduce the behavior:

Go to 'ConsensusCruncher/extract_barcodes.py'
Use commands 'ConsensusCruncher/extract_barcodes.py --read1 --read2 --outfile --blist'
Use inputs available at '...'
See error

Expected behavior

Extract UMI barcode

Screenshots

If applicable, add screenshots to help explain your problem.

System (please complete the following information):

OS: Linux
Software versions: Feb 14, 2020

Additional context

Add any other context about the problem here.

Relevant inputs

Please make sure to provide relevant input files

Bam header RG conflict

Describe the bug

The header in the bam files that are created by bwa aligner have two instances of @rg written in the bam header and GATK tools have issues with that. Example:
@rg ID:1 SM:SE17-1161_N_cfDNA-010-R1.fastq.gz PL:Illumina
@pg ID:bwa PN:bwa VN:0.7.15-r1140 CL:/mnt/work1/software/bwa/0.7.15/bwa mem -M -t4 -R @rg ID:1 SM:SE17-1161_N_cfDNA-010-R1.fastq.gz

To Reproduce

Steps to reproduce the behavior:

Go to '...'
Use commands '....'
Use inputs available at '...'
See error

Expected behavior

A clear and concise description of what you expected to happen.

Screenshots

If applicable, add screenshots to help explain your problem.

System (please complete the following information):

OS: [e.g. iOS, Linux, etc]
Software versions

Additional context

Add any other context about the problem here.

Relevant inputs

Please make sure to provide relevant input files

The results output is confusing.

Is your feature request related to a problem? Please describe.
In the results output, you get a "bamfiles" and a "consensus" folder.
If you stop at the description of the input parameters from the GitHub documentation page, you would think that the "bamfiles" folder is the output of the program itself.

-o OUTPUT, --output OUTPUT
Output directory, where barcode extracted FASTQ and BAM files will be placed in subdirectories 'fastq_tag' and 'bamfiles' respectively (dir will be created if they do not exist). [MANDATORY]

However, be aware that the "bamfiles" folder contain the bam files from the alignments by bwa sorted and indexed with samtools and NOT the one from consensus cruncher. The bam files resulting from consensus cruncher are in the "consensus" folder (there are more of them, up to 5). This is better explained after in the documentation so it's important to read the full documentation before using the program even though I think this part should be edited (in general users tend to go fast and stop reading at the description of the parameters).

Describe the solution you'd like

The output folders need to be definitively renamed to avoid any confusion for the users:
"bamfiles" to "bamfiles from bwa-samtools"
"consensus" to "bamfiles from consensus-cruncher"
And the GitHub documentation page should be edited too regarding the description of the parameters as this part will lead almost anyone to confusion.

Error in consensus mode

Hi,

We ran ConsensusCruncher.py fastq2bam on the LargeMid_56 fastq files provided in the test directory. The barcode-extracted fastq files and subsequently the BWA aligned BAMfile were generated without any errors.

ConsensusCruncher.py consensus was run with the following command:
python3 ~/testing/duplex/ConsensusCruncher/ConsensusCruncher.py consensus -i ~/testing/duplex/results/largemid/output/bamfiles/LargeMid_56_L005_R1.fastq.sorted.bam -o ~/testing/duplex/results/largemid/output/consensus_output/ -s ~/programs/samtools-1.3.1/samtools --cleanup True --cutoff 0.7 -b ~/testing/duplex/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt

This gives the following error:

# === SSCS ===
Uncollapsed - Total reads: 19579
Uncollapsed - Unmapped reads: 0
Uncollapsed - Secondary/Supplementary reads: 0
SSCS reads: 0
Singletons: 0
Bad spacers: 19579

# QC: Total uncollapsed reads should be equivalent to mapped reads in bam file.
Total uncollapsed reads: 19579
Total mapped reads in bam file: 19562
QC: check dictionaries to see if there are any remaining reads
=== pair_dict remaining ===
=== read_dict remaining ===
=== csn_pair_dict remaining ===
Traceback (most recent call last):
  File "/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/SSCS_maker.py", line 391, in <module>
    main()
  File "/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/SSCS_maker.py", line 374, in main
    plt.xlim([0, math.ceil(lst_tags_per_fam[-1][0]/10) * 10])
IndexError: list index out of range
# === DCS ===
SSCS - Total reads: 0
SSCS - Unmapped reads: 0
SSCS - Secondary/Supplementary reads: 0
DCS reads: 0
SSCS singletons: 0 

Traceback (most recent call last):
  File "/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/singleton_correction.py", line 320, in <module>
    main()
  File "/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/singleton_correction.py", line 291, in main
    sscs_correction_frac = (sscs_dup_correction/singleton_counter) * 100
ZeroDivisionError: division by zero
# === DCS - Singleton Correction ===
SSCS SC - Total reads: 0
SSCS SC - Unmapped reads: 0
SSCS SC - Secondary/Supplementary reads: 0
DCS SC reads: 0
SSCS SC singletons: 0 

/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/SSCS_maker.py --infile /scratch/hematopath/testing/duplex/results/largemid/output/bamfiles/LargeMid_56_L005_R1.fastq.sorted.bam --outfile /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted.sscs.bam --cutoff 0.7 --bedfile /scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt --bdelim None
/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/DCS_maker.py --infile /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted.sscs.sorted.bam --outfile /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/dcs/LargeMid_56_L005_R1.fastq.sorted.dcs.bam --bedfile /scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt
/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/singleton_correction.py --singleton /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted.singleton.sorted.bam --bedfile /scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt
/scratch/hematopath/programs/samtools-1.7/samtools merge /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs_sc/LargeMid_56_L005_R1.fastq.sorted.sscs.sc.bam /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted.sscs.sorted.bam /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs_sc/LargeMid_56_L005_R1.fastq.sorted.sscs.correction.sorted.bam /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs_sc/LargeMid_56_L005_R1.fastq.sorted.singleton.correction.sorted.bam
/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/DCS_maker.py --infile /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs_sc/LargeMid_56_L005_R1.fastq.sorted.sscs.sc.sorted.bam --outfile /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/dcs_sc/LargeMid_56_L005_R1.fastq.sorted.dcs.sc.bam --bedfile /scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt
/scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/dcs_sc/LargeMid_56_L005_R1.fastq.sorted.all.unique.dcs.bam
Traceback (most recent call last):
  File "/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher.py", line 484, in <module>
    args.func(args)
  File "/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher.py", line 301, in consensus
    '{}/{}_tag_fam_size.png'.format(sample_dir, identifier))
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted_tag_fam_size.png' -> '/scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/LargeMid_56_L005_R1.fastq.sorted_tag_fam_size.png'

System
OS: RHEL(7.3)
Other software versions:
Python 3.7.4
BWA 0.7.15
Samtools 1.3.1

We are unable to figure out what could be causing these errors and would be extremely grateful if you can help us with this.
Thanks

pughlab / consensuscruncher Goto Github PK

consensuscruncher's Introduction

ConsensusCruncher

Quick start

Dependencies

Configuration

Running ConsensusCruncher

Individual files

Multiple files

Overview

Example

Fastqs to Bams

ConsensusCruncher

How it works

Issues and feature requests

Who do I talk to?

consensuscruncher's People

Contributors

Stargazers

Watchers

Forkers

consensuscruncher's Issues

Recommend Projects

Recommend Topics

Recommend Org