opengene / gencore Goto Github PK

Generate duplex/single consensus reads to reduce sequencing noises and remove duplications

License: MIT License

Makefile 0.38% C++ 99.40% C 0.22%

sequencing ngs bioinformatics sequencing-error consensus deep-sequencing sequencing-noise duplication duplex deduplication

gencore's Introduction

gencore

An efficient tool to remove sequencing duplications and eliminate sequencing errors by generating consensus reads.

What's gencore
Download, compile and install
Why to use gencore
Understand the output
How it works
Command examples
UMI format
All options
Read/cite gencore paper

what's gencore?

gencore is a tool for fast and powerful deduplication for paired-end next-generation sequencing (NGS) data. It is much faster and uses much less memory than Picard and other tools. It generates very informative reports in both HTML and JSON formats. It's based on an algorithm for generating consensus reads, and that's why it's named gencore.

Basically, gencore groups the reads derived from the same original DNA template, merges them by generating a consensus read, which contains much less errors than the original reads.

gencore supports the data with unique molecular identifiers (UMI). If your FASTQ data has UMI integrated, you can use fastp to shift the UMI to read query names, and use gencore to generate consensus reads.

This tool can eliminate the errors introduced by library preparation and sequencing processes, and consenquently reduce the false positives for downstream variant calling. This tool can also be used to remove duplicated reads. Since it generates consensus reads from duplicated reads, it outputs much cleaner data than conventional duplication remover. Due to these advantages, it is especially useful for processing ultra-deep sequencing data for cancer samples.

gencore accepts a sorted BAM/SAM with its corresponding reference fasta as input, and outputs an unsorted BAM/SAM.

take a quick glance of the informative report

Sample HTML report: http://opengene.org/gencore/gencore.html
Sample JSON report: http://opengene.org/gencore/gencore.json

try gencore to generate above reports

BAM file for testing: http://opengene.org/gencore/input.sorted.bam
BED file for testing: http://opengene.org/gencore/test.bed
Reference genome file: ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Assembly/GRCh37-HG19_Broad_variant/Homo_sapiens_assembly19.fasta
Command for testing:

gencore -i input.sorted.bam -o output.bam -r Homo_sapiens_assembly19.fasta -b test.bed --coverage_sampling=50000

After the processing is finished, check the gencore.html and gencore.json in the working directory. The option --coverage_sampling=50000 is to change the default setting (coverage_sampling=10000) to generate smaller report files by reducing the coverage sampling rate.

quick examples

The simplest way

gencore -i input.sorted.bam -o output.bam -r hg19.fasta

With a BED file to specify the capturing regions

gencore -i input.sorted.bam -o output.bam -r hg19.fasta -b test.bed

Only output the fragment with >=2 supporting reads (useful for aggressive denoising)

gencore -i input.sorted.bam -o output.bam -r hg19.fasta -b test.bed -s 2

get gencore

install with Bioconda

conda install -c bioconda gencore

download binary

This binary is only for Linux systems: http://opengene.org/gencore/gencore

# this binary was compiled on CentOS, and tested on CentOS/Ubuntu
wget http://opengene.org/gencore/gencore
chmod a+x ./gencore

or compile from source

# step 1: download and compile htslib from: https://github.com/samtools/htslib
# step 2: get gencore source (you can also use browser to download from master or releases)
git clone https://github.com/OpenGene/gencore.git

# step 3: build
cd gencore
make

# step 4: install
sudo make install

why to use gencore?

As described above, gencore can eliminate the errors introduced by library preparation and sequencing processes, and consenquently it can greatly reduce the false positives for downstream variant calling. Let me show your an example.

original BAM

This is an image showing a pileup of the original BAM. A lot of sequencing errors can be observed.

gencore processed BAM

This is the image showing the result of gencore processed BAM. It becomes much cleaner. Cheers!

QC result reported by gencore

gencore also performs some quality control when processing deduplication and generating consensus reads. Basically it reports mapping rate, duplication rate, mismatch rate and some statisticical results. Especially, gencore reports the coverate statistics of input BAM file in genome scale, and in capturing regions (if a BED file is specified).

gencore reports the results both in HTML format and JSON format for manually checking and downstream analysis. See the examples of interactive HTML report and JSON reports.

coverate statistics in genome scale

coverate statistics in capturing regions

understand the output

gencore outputs following files:

the processed BAM. In this BAM, each consensus read will have a tag FR, which means forward read number of this consensus read. If the read is a duplex consensus read, it will also has a tag RR, which means reverse read number of this consensus read. Downstream tools can read the FR and RR tags for further processing or variant calling. In following example, the first read is a single-stranded consensus sequence (only has a FR tag), and the second read is a duplex consensus sequence (has both FR and RR tags):

A00250:28:H2HC3DSX2:1:1117:3242:5321:UMI_GCT_CTA        161     chr12   25377992        60      143M    =       25378431        582
     GCAATAATTTTTGTCAGAAAAATGCATTAAATGAATAACAGAATTTCTGTTGGCTTTCTGGGTATTGTCTTTCTTTAATGAGACCTTTCTCCAGAAATAAACACATCCTCAAAAAAATTCTGCCAAAGTAAAATTCTTCAAAT FFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:1  MD:Z:34G108     AS:i:138        XS:i:21 RG:Z:cfdna      FR:i:2
A00250:28:H2HC3DSX2:1:2316:10547:25989:UMI_AAC_AGA      161     chr12   25377993        60      143M    =       25378462        612
     CAATAATTTTTGTCAGAAAAATGCATTAAATGAATAACAGAATTTCTGTTGGCTTTCTGGGTATTGTCTTTCTTTAATGAGACCTTTCTCCAGAAATAAACACATCCTCAAAAAAATTCTGCCAAAGTAAAATTCTTCAAATA FFFFF:FFFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFF,FFFFFFFFFFFF,:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFF,!FF:F:F:F,FFF,F:FFFF,,:F,FFFF:FF:,:FF:F,:, NM:i:1  MD:Z:33G67A41   AS:i:133        XS:i:21 RG:Z:cfdna      FR:i:1  RR:i:5

the JSON report. A json file contains lots of statistical informations.
the HTML report. A html file visualizes the information of the JSON.
the plain text output.

how it works

important steps:

clusters the reads by their mapping positions and UMIs (if UMIs are applicable).
for each cluster, compares its supporting reads number (the number of reads/pairs for this DNA fragment) with the threshold specified by supporting_reads. If it passes, start to generate a consensus read for it.
if the reads are paired, finds the overlapped region of each pair, and scores the bases in the overlapped regions according their concordance and base quality.
for each base position at this cluster, computes the total scores of each different nucleotide (A/T/C/G/N).
if there exists a major nucleotide with good quality, use this nucleotide for this position; otherwise, check the reference nucleotide from reference genome (if reference is specified).
when checking the reference, if there exists one or more reads are concordant with reference genome with high quality, or all reads at this positions are with low quality, use the reference nucleotide for this position.

the quality thresholds

gencore uses 3 different thresholds, and they can be specified by the commandline options：

Quality threshold	Default Score	CMD option
High Quality	30 (Q30)	--high_qual
Moderate Quality	20 (Q20)	--moderate_qual
Low Quality	15 (Q15)	--low_qual

the scoring

gencore assigns a score to each base in a read of a read cluster, the score means the confidence of this base. The score is given by following rules:

in overlapped region?	matched with its pair?	condition?	score for this base
NO	N/A	HIGH_QUAL <= this_qual	8
NO	N/A	MODERATE_QUAL <= this_qual < HIGH_QUAL	6
NO	N/A	LOW_QUAL <= this_qual < MODERATE_QUAL	4
NO	N/A	this_qual < LOW_QUAL	2
YES	YES	2 * HIGH_QUAL <= this_qual + pair_qual	12
YES	YES	2 * MODERATE_QUAL <= this_qual + pair_qual < 2 * HIGH_QUAL	10
YES	YES	2 * LOW_QUAL <= this_qual + pair_qual < 2 * MODERATE_QUAL	8
YES	YES	this_qual + pair_qual < 2 * LOW_QUAL	6
YES	NO	HIGH_QUAL <= this_qual - pair_qual	5
YES	NO	MODERATE_QUAL <= this_qual - pair_qual < HIGH_QUAL	3
YES	NO	LOW_QUAL <= this_qual - pair_qual < MODERATE_QUAL	1
YES	NO	this_qual - pair_qual < LOW_QUAL	0

In this table:

this_qual is the quality of this base
pair_qual is the quality of the corresponding in the overlapped region of a pair.
HIGH_QUAL is the quality threshold that can be specified by --high_qual
MODERATE_QUAL is the quality threshold that can be specified by --moderate_qual
LOW_QUAL is the quality threshold that can be specified by --low_qual

In the overlapped region, if a base and its pair are mismatched, its quality score will be adjusted to: max(0, this_qual - pair_qual)

command examples

If you want to get very clean data, we can only keep the clusters with 2 or more supporting reads (recommended for ultra-deep sequencing with higher dup-rate):

gencore -i in.bam -o out.bam -r hg19.fa -s 2

If you want to keep all the DNA fragments, we can set supporting_reads to 1 (this option can be used to replace picard markduplicate to deduplication):

gencore -i in.bam -o out.bam -r hg19.fa -s 1

(Recommanded) If you want to keep all the DNA fragments, and for each output read you want to discard all the low quality unoverlapped mutations to obtain a relative clean data (recommended for dup-rate < 50%):

gencore -i in.bam -o out.bam -r hg19.fa -s 1 --score_threshold=9

If you want to obtain fewer but ultra clean data, and your data has UMI, you can enable the duplex_only option, and increase the supporting_reads and the ratio_threshold:

gencore -i in.bam -o out.bam -r hg19.fa --duplex_only -s 3 --ratio_threshold=0.9

Please note that only UMI-integrated paired-end data can be used to generate duplex consensuses sequences.

UMI format

gencore supports calling consensus reads with or without UMI. Although UMI is not required, it is strongly recommended. If your FASTQ data has UMI integrated, you can use fastp to shift the UMI to read query names.

The UMI should in the tail of query names. It can have a prefix like UMI, followed by an underscore. If the UMI has a prefix, it should be specified by --umi_prefix or -u. If the UMI prefix is umi or UMI, it can be automatically detected. The UMI can also have two parts, which are connected by an underscore.

UMI examples

Read query name = "NB551106:8:H5Y57BGX2:1:13304:3538:1404:UMI_GAGCATAC", prefix = "UMI", umi = "GAGCATAC"
Read query name = "NB551106:8:H5Y57BGX2:1:13304:3538:1404:umi_GAGC_ATAC", prefix = "umi", umi = "GAGC_ATAC"
Read query name = "NB551106:8:H5Y57BGX2:1:13304:3538:1404:GAGCATAC", prefix = "", umi = "GAGCATAC"
Read query name = "NB551106:8:H5Y57BGX2:1:13304:3538:1404:GAGC_ATAC", prefix = "", umi = "GAGC_ATAC"

all options

options:
  -i, --in                       input sorted bam/sam file. STDIN will be read from if it's not specified (string [=-])
  -o, --out                      output bam/sam file. STDOUT will be written to if it's not specified (string [=-])
  -r, --ref                      reference fasta file name (should be an uncompressed .fa/.fasta file) (string)
  -b, --bed                      bed file to specify the capturing region, none by default (string [=])
  -x, --duplex_only              only output duplex consensus sequences, which means single stranded consensus sequences will be discarded.
      --no_duplex                don't merge single stranded consensus sequences to duplex consensus sequences.
  -u, --umi_prefix               the prefix for UMI, if it has. None by default. Check the README for the defails of UMI formats. (string [=auto])
  -s, --supporting_reads         only output consensus reads/pairs that merged by >= <supporting_reads> reads/pairs. The valud should be 1~10, and the default value is 1. (int [=1])
  -a, --ratio_threshold          if the ratio of the major base in a cluster is less than <ratio_threshold>, it will be further compared to the reference. The valud should be 0.5~1.0, and the default value is 0.8 (double [=0.8])
  -c, --score_threshold          if the score of the major base in a cluster is less than <score_threshold>, it will be further compared to the reference. The valud should be 1~20, and the default value is 6 (int [=6])
  -d, --umi_diff_threshold       if two reads with identical mapping position have UMI difference <= <umi_diff_threshold>, then they will be merged to generate a consensus read. Default value is 1. (int [=1])
  -D, --duplex_diff_threshold    if the forward consensus and reverse consensus sequences have <= <duplex_diff_threshold> mismatches, then they will be merged to generate a duplex consensus sequence, otherwise will be discarded. Default value is 2. (int [=2])
      --high_qual                the threshold for a quality score to be considered as high quality. Default 30 means Q30. (int [=30])
      --moderate_qual            the threshold for a quality score to be considered as moderate quality. Default 20 means Q20. (int [=20])
      --low_qual                 the threshold for a quality score to be considered as low quality. Default 15 means Q15. (int [=15])
      --coverage_sampling        the sampling rate for genome scale coverage statistics. Default 10000 means 1/10000. (int [=10000])
  -j, --json                     the json format report file name (string [=gencore.json])
  -h, --html                     the html format report file name (string [=gencore.html])
      --debug                    output some debug information to STDERR.
      --quit_after_contig        stop when <quit_after_contig> contigs are processed. Only used for fast debugging. Default 0 means no limitation. (int [=0])
  -?, --help                     print this message

citation

The gencore paper has been published in BMC Bioinformatics: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3280-9. If you used gencore in your research work, please cite it as:

Chen, S., Zhou, Y., Chen, Y. et al. Gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data. BMC Bioinformatics 20, 606 (2019) doi:10.1186/s12859-019-3280-9

gencore's People

Contributors

Stargazers

Watchers

gencore's Issues

Extracting UMIs for paired-end data

Hi!

I am trying to use this tool with paired-end data. My UMI are in the 5' extreme of my R2 fastq, so I am extracting them with fastp to their corresponding header. However, R1 fastq do not contain UMI, because they represent the other extreme of the DNA fragment. Do you know if it is neccessary to record UMI in read headers both for R1 and R2 reads? Or the program detects the UMI both for R1 reads because of hit mates in the bam?

Thanks!

gencore seems to have wrongly modified my SAM alignment record

Hi,
Gencore seems to have wrongly modified my SAM alignment record.
I have SAM record as below:
NS500830:1150:HWNN2AFX3:2:11202:24875:15467:GGT_GGC 161 chr7 55268915 60 71M chrUn_gl000220 132375 0 ACTCCAACTTCTACCGTGCCCTGATGGATGAAGAAGACATGGACGACGTGGTGGATGCCGACGAGTACCTC /AEEEAEEEEAEEEEEEEEEEEAEEEAAEEEEEEEEEEEEEEEEE/AEEEAEEEEEEEEEEAEEEE/EEE< NM:i:0 MD:Z:71 MC:Z:71M AS:i:71 XS:i:19 RG:Z:BT2302040049LNCBC NS500830:1150:HWNN2AFX3:2:11202:24875:15467:GGT_GGC 81 chrUn_gl000220 132375 57 71M chr7 55268915 0 TCTCTCTCTCTCTCTCTCTCTGCCTCTCTCACTGTGTCTGTCTTCTGTCTTATTCTCTTTCTCTCTCTGTC EEEEEEAEEEEEEEEEEEEEEEAEEEEEEEEAEAEE/EEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:71 MC:Z:71M AS:i:71 XS:i:52 RG:Z:BT2302040049LNCBC
but, after gencore remove duplicate reads using s=2, I got
NS500830:1150:HWNN2AFX3:2:11202:24875:15467:GGT_GGC 161 chr7 55268915 60 71M chrUn_gl000220 132375 0 ACTCCAACTTCTACCGTGCCCTGATGGATGAAGAAGACATGGACGACGTGGTGGATGCCGACGAGTACCTC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:71 MC:Z:71M AS:i:71 XS:i:19 RG:Z:BT2302040049LNCBC NS500830:1150:HWNN2AFX3:2:11202:24875:15467:GGT_GGC 145 chrUn_gl000220 132375 25 71M chr7 55268915 0 TCTCTCTCTCTCTCTCTCTCTGCCTCTCTCACTGTGTCTGTCTTCTGTCTTATTCTCTTTCTCTCTCTGTC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NM:i:1 MD:Z:7C63 MC:Z:69M2S AS:i:66 XS:i:57 RG:Z:BT2302040049LNCBC XA:Z:chrY,-10024049,71M,4;

why 81 convert to 145?
Thx

running speed very slow for me

Hi - I have some amplicon (with UMI) data (ultra high depth for targeted regions) which I preprocessed with fastp, aligned and now running gencore to dedup. The files have been running for 11 hrs by now (and still running) - is it normal or could there be some error. Any suggestions to speed this up.

about single-read seqence

can I use it on single-read seqence bam data ?

Is it intended action? The read2 flag was converted to read1 flag.

Downloads.zip

I have paired-reads which has 99 (read1) and 147 (read2) flags each.
When I ran the gencore, it was converted to 99 (read1) and 83 (read1).

I didn't use UMI.

Gencore command:
gencore -i TMP.bam -o TMP.unsorted.bam -s 2 -h TMP.html -j TMP.json -r Homo_sapiens_assembly38.fasta -b tmp.bed

In the attached files, please check paired reads named "NDX550236:59:H5VYCBGXF:1:11101:10469:6947".

Data used in the paper?

Hello,

Where can I download the data used in your paper, specifically the ones used for umi-tools and gencore?

Thanks in advance.

Kind regards,

Cedrick

ERROR: Wrong base with value 0

Error message as: "ERROR: Wrong base with value 0", then no output and hanging.

Add option to never toss reads

Hi,

I have noticed that in rare cases, gencore will remove reads from the input entirely. This can cause PE reads to become unpaired and causes problems for certain downstream tools that expect the .bam to contain only paired reads.

Specifically, I think the problem is due to cluster.cpp line 322-324:
if(mostContainedByNum < containedByList.size()*0.4 && containedByList.size() != 1) { return NULL; }

If the major read does not make up at least 40%, no read is selected at all. I think it would be better to always take the best consensus read, choosing randomly if necessary, because I'd rather have slightly imperfect data than no data at all.

Could you add a --nevertoss option, so the user can specify that reads shall never be discarded, even if there is no clear consensus? I have very little experience in C++ and am unable to add this in myself (although commenting out the above and recompiling fixes the problem).

Thanks and best regards,
Florian

Feature request: support deduplication based on [contig]:[R1 coordinate] only

I am glad to see this tool for generating consensus reads. Unfortunately it does not work with some types of data. For example, where library prep chemistry allows for multiple read pairs to be read off the same fragment with differing template lengths (differing R1-R2 spans). One example of such chemistry is Anchored Multiplex PCR: https://www.nature.com/articles/nm.3729

To do this you would cluster on the start position of the R1 but not include the right_pos of the pair. Is this something that could be easily supported via a command line flag?

EOF markers absent

Hello, I'm trying to use gencore as described with the simple examples in the README, and some of the resulting files get the error

[W::bam_hdr_read] EOF marker is absent. The input is probably truncated

when trying to run samtools sort on them.

The process to create the input alignment files for gencore has been:

fastp
bwa mem
samtools sort
samtools index

Some of the files do have a UMI read mismatch in the gencore output. What could be causing that, and do you think that it related to the issue?

change 'nan' in output to 'NaN'

It seems that python json.load cannot handle lowercase 'nan', would it be possible to change the output in json files to 'NaN'?

gencore don't generate consensus read from two pair reads with reciprocal UMI ,such as UMI ATGC_GCAA and GCAA_ATGC

I use the gencore to deal with bam file which has the umi info ,but the output file in some position inconsensus with the description.
the reads pair1 has the umi TGT_CGA and the position chr17:7578090-7578182:

the reads pair2 has the umi CGA_TGT and the same position chr17:7578090-7578182:

but the reads pair1 and pair2 don't generate one consensus reads pair.
the reads pair3 and pair4 has the same situation, as follow:

how can i generate one consensus reads pair

typo

Hi
our packaging system found this typo several times in the README.md file:
consenquently should be consequently

output bam problem

Output bam file appear unsort region, but Iinput file is sorted bam. I have run dozens of sample, and this sample appear this problem.

cmd: gencore -i XYD0160643_L1.raw_data_mapped.bam -o XYD0160643_L1.consensus.bam -r human_g1k_v37.fasta -b target.bed -j HO0330-XYD0160643_L1.json -h HO0330-XYD0160643_L1.html

Segmentation fault (core dumped) after loading contigs from BAM file

I am trying to run gencore on sorted RNA reads mapped with STAR. The input bam file is 708 MB. The reference genome is 22MB.
Here is the command I used to run gencore:

 `gencore -i mapped_reads/R51 -o gencore_bams/R51_gencore.bam -r PlasmoDB-39_PrelictumSGS1-like_Genome.fasta --debug`

It reads in the ref file great, then it lists all the contigs present in the BAM file. Then it crashes:

Starting contig 0
Segmentation fault (core dumped)

I installed gencore using conda.

Mismatched UMI of a pair of reads

hello, I meet some mistake with the gencore (Version: 0.17.2):
1 contigs in the bam file:
chr1: 7249 bp

Mismatched UMI of a pair of reads
Left:
0:0, M:0:0 TLEN:0 ID:0
E100030802L1C001R00101395324:umi_CA_NN
TTTTTTTTTTTTTTTTTTTTTATTTTTTTTTTATTATTAATATTATTATTAATTTTAAAATCACAACAAAATACAAAAAACAAAAAAAAAAA
GGGGGHHGGHIGGGGGHGGGIIGGGGGGGGGGIGGGGHGIHGGGGGGHGGGGGGGGIGGGGGGHGGGHIGIGGGGGIIGGHHGGGGHIGGHG
Right:
0:0, M:0:0 TLEN:0 ID:0
E100030802L1C004R03200085905:umi_CG_NN
TTAAAAAATTATAAAAAAAAAATAAAAAAATAAAAAAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
ERROR: The UMI of a read pair should be identical, but we got CA_ and CG_

but, when i find the reads in original bam ,the reads pair seem to be correct with same UMI :

it's seem like the gencore consider the two read with different read name to be one pair read.

the command is:
gencore --umi_prefix=umi -s 3 --ref hg19.fasta --quit_after_contig 25 -i input.bam -o output.umi.bam

but when I add the "-d" parameter,there is no mistake , maybe there has some difference,:
gencore --umi_prefix=umi -d 0 -s 3 --ref hg19.fasta --quit_after_contig 25 -i input.bam -o output.umi.bam

Looking forward to your reply!

gencore-0.17.2 incompatible with python 3.10, python 3.8, python 3.6 or python 2.7

Hi,

I want to use gencore-0.17.2 and downloaded it using conda on Linux:
conda install -c bioconda gencore=0.17.2

However, I tried four versions of python (3.10, 3.8, 3.6, 2.7). none of these is compatible with it, omitting errors including:
Package libstdcxx-ng conflicts for: python=2.7 -> libffi[version='>=3.4,<3.5'] -> libstdcxx-ng[version='>=11.2.0|>=7.5.0'] python=2.7 -> libstdcxx-ng[version='>=7.2.0|>=7.3.0']

Package libgcc-ng conflicts for: python=2.7 -> libffi[version='>=3.4,<3.5'] -> libgcc-ng[version='>=7.5.0'] python=2.7 -> libgcc-ng[version='>=11.2.0|>=7.3.0|>=7.2.0']

Do you have any idea about which version of python I should use?
Thank you in advance!

incorrect bam record sort function which might lead to core dump incidently

incorrect bam record sort logic in https://github.com/OpenGene/gencore/blob/master/src/gencore.h#L33-L36
and fixed by pull request #26
please consider a merge

SAM file validation errors after running gencore

Hello,

After deduplicating with gencore I am getting a number of errors with GATK's ValidateSamFile:

ERROR::MISMATCH_MATE_ALIGNMENT_START, Mate alignment does not match alignment start of mate
ERROR::MISMATCH_FLAG_MATE_NEG_STRAND, Mate negative strand flag does not match read negative strand flag of mate
ERROR::MATES_ARE_SAME_END, Both mates are marked as first of pair
ERROR::MISMATCH_FLAG_MATE_UNMAPPED, Mate unmapped flag does not match read unmapped flag of mate
ERROR::MATE_NOT_FOUND, Mate not found for paired read

I don't get these errors when I run ValidateSamFile on the BAM file pre-deduplication.

Testing gencore with a dummy fastq results in no clusters

Hey @sfchen,

I'm currently testing how gencore handles reads in different case scenarios. For example if gencore can rescue reads with UMI with sequencing errors and or having an N present in the UMI, how it deals with singletons and strand biases (compared to other UMI programs). I made 2 fastq files (read 1 and read 2), that contain 50bp reads (46 genome nt, and 4 UMI nt) that have 25 bp overlap between each read pair. I then used fastp to extract the reads and bwa to align the reads to chr8.

I ran the following command:

gencore -i nov.sorted.bam -o gencore/gencore.bam -r ref/chr8.fa -b NOV.bed -s 1 --umi_prefix = "UMI"

BED:
NOV.bed.txt

Sorted Bam:
nov.sorted.bam.txt

Gencore recognizes that reads are present, but fails to cluster them:

loading reference data:
chr8: 146364022 bp

loaded 1 contigs

1 contigs in the bam file:
chr8: 146364022 bp

----Before gencore processing:
Total reads: 30
Total bases: 1380
Mapped reads: 30 (100.000000%)
Mapped bases: 1380 (100.000000%)
Bases mismatched with reference: 12 (0.869565%)
Reads with mismatched bases: 8 (26.666667%)
Total mapping clusters: 0
Mapping clusters with multiple fragments: 0
Total fragments: 0
Fragments with single-end reads: 0
Fragments with paired-end reads: 0
Duplication level histogram:

----After gencore processing:
Total reads: 0
Total bases: 0
Mapped reads: 0 (nan%)
Mapped bases: 0 (nan%)
Bases mismatched with reference: 0 (nan%)
Reads with mismatched bases: 0 (nan%)
Total mapping clusters: 0
Mapping clusters with multiple fragments: 0
Total fragments: 0
Fragments with single-end reads: 0
Fragments with paired-end reads: 0
Duplication level histogram:

gencore -i nov.sorted.bam -o gencore/gencore.bam -r ref/chr8.fa -b NOV.bed -s 1 --umi_prefix = UMI
gencore v0.13.0, time used: 5 seconds

I was wondering if you could comment on why there is no fragment clustering (in the presence of UMIs and overlap between the reads in the pair)?

genecore will change the MD:Z value in the bam after umi cluster？

Hi,
I want to know if the genecore will change the MD:Z value in the bam after umi cluster? Because the umi cluster correct the mismatch base from artifact.

The software installation is not working properly.

this binary was compiled on CentOS, and tested on CentOS/Ubuntu

wget http://opengene.org/gencore/gencore
chmod a+x ./gencore

I try the above method, but, I can not use this software.

Server report an error:
$ ./gencore
./gencore: error while loading shared libraries: libhts.so.2: cannot open shared object file: No such file or directory

No coverage at the end of the reference after using Gencore

Hi,

I'm using Gencore in an attempt to remove sequencing errors, but i ran into an issue.
After using Gencore i noticed that all the reads covering the end of the reference are removed. Before Gencore there was 2000+ bases coverage, it seems very unlikely that all these reads are duplicates. Could there be an other reason why they are removed? I attached an screenshot of the mapping in IGV. The top is before Gencore and the bottom is after Gencore.

The command i used is:
gencore -i in.bam -o out.bam -r reference.fa -s 1

Thanks in advance.

Behavior relative to picard?

Hi, I separately tried running gencore on a paired-end sequencing run

gencore -i mix_2lines_PE.bam -o gencore_out.bam -r $fasta -s 1

The input .bam file is 738MB. The output bam from picard is 395MB whereas the output from gencore is only 66MB. The data is a relatively low-complexity targeted sequencing sample, but this head-to-head test indicated that gencore is filtering much more than picard. Indeed, the picard bam has about 6 times as many reads when you count compared to the gencore output bam. Any help would be greatly appreciated.

Nanopore or Pacbio HiFi data

Hi @sfchen !

I was wondering if this tool can be used withn Nanopore or Pacbio HiFi data. And with what parameters?

output read cluster information

Hi，@sfchen
Is it possible to output read cluster information? Following is my intention：

Stat which bases were corrected by genecore.
The sequencing error detected by gencore should be highly trusted. And correction detail may provide a solid base ground for modeling sequencing error.
The learned model should be a good guider to eliminate or include mutation when evidence such as supporting reads are rare.

Maybe this idea could be incorporated into gencore to further reduce sequencing error when a cluster size is small. Looking forward to your comment.
Best regards!

report error or?

Hi, the HTML report I get after running gencore shows all 0 numbers after processing (see screenshot below), while the new BAM file looks fine. What's going on?

Mismatched UMI of a pair of reads

Hi,
I'm got an error while generating consensus sequence with gencore (v 0.15.0):

Mismatched UMI of a pair of reads
Left:
1:11184675, M:18:63057755 TLEN:0 ID:0
V300062821L1C003R01400419214:umi_TGGG_CACC M81S13
CATGTCCGTTGCTGCCTGTAAGGAACAGTGGGAGCGGTGAGTGTACATCAGAGGTCCTCAGCTCTTCAGCTGAGCGCACCATCTTAGTCAGGCA
FFFGFFFFFFFFFGFGFFEFFGFFFFFFFFGFFFFFFEFFFFFFFCFFEFFFFFFEFEFFFFFDAFFBGEEGF=,??>GDB8C06F:>FG4894
Right:
1:11184675, M:18:63057755 TLEN:0 ID:0
V300062821L1C004R03300362156:umi_TGGG_CCCC M81S13
CATGTCCGTTGCTGCCTGTAAGGAACAGTGGGAGCGGTGAGTGTACATCAGAGGTCCTCAGCTCTTCAGCTGAGCGCACCATCTTAGTCAGGCA
FFFGFFFFFFFFFGFGFFEFFGFFFFFFFFGFFFFFFEFFFFFFFCFFEFFFFFFEFEFFFFFDAFFBGEEGF=,??>GDB8C06F:>FG4894
ERROR: The UMI of a read pair should be identical, but we got TGGG_CACC and TGGG_CCCC

I'm confused that, the suggestived reads (Left/Right) are actually un-paired, but how could gencore identify them as paired reads?

Here is the command:
gencore --umi_prefix=umi -s 3 --ref hg19.fasta -i test.bam -o out.bam

And this the BAM (please rename to 'test.bam'):

test.bam.gz

Thanks a lot~

Best,
Gerde

'=' symbol in the BAM/CRAM SEQ field

Hi，after I use gencore 0.17.2 to process my bam file, I found '=' symbol in a consensus read sequence as follows:

my command is like this:

please help me out, thnaks!

Insertion/deletion may cause UMI-based collapsing failed

Hi, I found an interesting situation that maybe ignored by Gencore. The following is the grep result of a Gencore bam. As we can see, one UMI (TGTTCCACGCTG ) corresponds to two different records, and the difference is only an insertion of 'A' in reads ：

$ samtools view EPS19F1X2QL11_S34.gencoreDedup.sorted.bam | grep :TGTTCCACGCTG 
chr7:140453219-140453252:34:-:BRAF:NB502121:60:HYCJVAFXY:1:21102:16215:1374:TGTTCCACGCTG	163	chr7	140453037	60	124M	=	140453105	182	AAAAATTTAATCAGTGGAAAAATAGCCTCAATTCTTACCATCCACAAAATGGATCCAGACAACTGTTCAAACTGATGGGACCCACTCCATCGAGATTTCACTGTAGCTAGACCAAAATCACCTA	<EEEEEEEEEEEEEEEEEAEEEEEEEEEEAEEEEEEEEEEEEEEEEEAEEAEE<E/AE<E/EE/EEEEEE/AE<EEEEA<AAEEAEEEEEEE/EE66AAEAAA<EAEEEEAE<E<<A<AEE</A	NM:i:0	MD:Z:124	MC:Z:114M	AS:i:124	XS:i:63	RG:Z:12878QL1
chr7:140453219-140453252:34:-:BRAF:NB502121:60:HYCJVAFXY:1:11101:22463:18229:TGTTCCACGCTG	163	chr7	140453038	60	124M	=	140453105	181	AAAATTTAATCAGTGGAAAAATAGCCTCAATTCTTACCATCCACAAAATGGATCCAGACAACTGTTCAAACTGATGGGACCCACTCCATCGAGATTTCACTGTAGCTAGACCAAAATCACCTAT	EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE	NM:i:0	MD:Z:124	MC:Z:114M	AS:i:124	XS:i:64	RG:Z:12878QL1
chr7:140453219-140453252:34:-:BRAF:NB502121:60:HYCJVAFXY:1:21102:16215:1374:TGTTCCACGCTG	83	chr7	140453105	60	114M	=	140453037	-182	AAACTGATGGGACCCACTCCATCGAGATTTCACTGTAGCTAGACCAAAATCACCTATTTTTACTGTGAGGTCTTCATGAAGAAATATATCTGAGGTGTAGTAAGTAAAGGAAAA	AAEA<<<EAEAA<AAAAEAEAA<A/E<EEEEE/EEEEA<EEEEEEE/<EEEAEEAEE6/EEEAEE<EEEEEEEEEEEEEA<AEEEA//EEEE/EEEE6EEEEEEEEEEEEEA/E	NM:i:0	MD:Z:114	MC:Z:124M	AS:i:114	XS:i:64	RG:Z:12878QL1
chr7:140453219-140453252:34:-:BRAF:NB502121:60:HYCJVAFXY:1:11101:22463:18229:TGTTCCACGCTG	83	chr7	140453105	60	114M	=	140453038	-181	AAACTGATGGGACCCACTCCATCGAGATTTCACTGTAGCTAGACCAAAATCACCTATTTTTACTGTGAGGTCTTCATGAAGAAATATATCTGAGGTGTAGTAAGTAAAGGAAAA	EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE	NM:i:0	MD:Z:114	MC:Z:124M	AS:i:114	XS:i:64	RG:Z:12878QL1

With best regards

Single-end read with UMI

Does gencore work with data like this, which is common in scRNA-seq?

My read names look like this: A00439:244:HFHN5DRXX:1:2139:21052:22357_CGAGAGGTACCTGTGAATTG

where the first part of the string is what the sequencer provides; what follows the underscore is the UMI.

I've tried running the following:

gencore -i mix_for_gencore.bam -o gencore_dedup.bam -r $fasta

and

gencore -i mix_for_gencore.bam -o gencore_dedup.bam -r $fasta -u _

but neither option results in a filtering of any reads (i.e. the output bam is the same size).

Settings for liquid biopsy?

Can you suggest specific settings for liquid biopsy data with UMIs?

Gencore failed to process BAM generated by STAR aligner

Hi, I tried to use gencore to generate consensus reads from BAM file produced from STAR aligner but it failed. Could you solve this problem？ Many thanks.

请教个问题

输入排序的bam，可以是不用比对的bam文件吗，比如使用picard的fastqtobam工具得到的未比对的bam文件

There are one or three reads with the same name after deduplication

after deduplication, they are paired read according to the flag value, but there are one or three reads with the same name. Why does this happen?

Reads with identical UMIs and mapping coordinates output by Gencore

Hello,

I am running the latest gencore with --no_duplex and -s = 3. However, I am still getting many reads output which have the same UMI and identical mapping coordinates. Please see the IGV screenshot below. Reads are shown as pairs and are grouped by UMI, which can be seen on the left.

write read groups

I have not found any options to write read groups while a lot of downstream tools require it. That means that I cannot fully drop picard and substitute it with gencore, because of this problem.

Regression issue

Collapsing fails

I'm having trouble when using gencore to collapse reads (with version 0.14.0 )

I have used this tools for long time and this is the first time that I have encountered there is no collapsed read.

The original BAM file IGV image says there must exist collapsed reads as shown in IGV below

It seems there is an error when I see the log file and says

----Before gencore processing:
Total reads: 4014
Total bases: 581400
Mapped reads: 4014 (100.000000%)
Mapped bases: 581400 (100.000000%)
Bases mismatched with reference: 643 (0.110595%)
Reads with mismatched bases: 409 (10.189337%)
Total mapping clusters: 0
Mapping clusters with multiple fragments: 0
Total fragments: 0
Fragments with single-end reads: 0
Fragments with paired-end reads: 0
Duplication level histogram: 

----After gencore processing:
Total reads: 0
Total bases: 0
Mapped reads: 0 (Segmentation fault (core dumped)

Is there any reason for this problem?

running speed is too slow, only using one CPU

the reference genome has 26 chrs, 2.4G size. And the sequence coverage is about 100X.

Currently gencore keeps running for >15 hours and still doesn't stop.

I found that gencore uses only one CPU all the time, and I think whether it would be better to change to multi-workers in the future.

UMI in Sam Tag

I would like to request support for processing UMI data that is contained in a sam tag (e.g. the UB tag in 10x data-- https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/bam). This is supported in Picard. https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.7.0/picard_sam_markduplicates_UmiAwareMarkDuplicatesWithMateCigar.php

I understand the temporary alternative is to modify my read names, but it would be really nice if this could be included as a feature.

output bam without FR or RR tag

Hi, I used this software with the cmd: gencore -i xx.fastp.sort.bam -o xx.gencore.bam -r hg19.fasta -s 1 --high_qual 30. However, there is no information about FR or RR tag in the output bam, is there something wrong with my cmd? Can you give me some suggestions? thanks so much.

how to set umi mismatch and umi cluster method?

Hi, when i using gencore, can I set umi mismatch or the clustering method like 'directional adjacency' in umi-tools?

How to set up the base quality of a concensus read?

Hello,
Some SNP calling algorithms use the base quality, could you give a brief description about how to set up the base quality of consensus read?

合并umi问题

hi，你好
我用的umi数据，大概是下面这样。感觉从IGV图来看合并重复序列的时候没有用上umi呢，不然不应该有下面IGV图里面那些分散的低质量C突变。是不是我的使用方法有问题？

下面是gencore代码

/mnt/nfs/zhujiaqi/test/umi/gencore --umi_prefix umi --ref /mnt/nfs/database/hg19/reference/ucsc.hg19.fasta -b /mnt/nfs/zhujiaqi/test/test.bed -i /mnt/nfs/zhujiaqi/test/umi/test1.sort.bam -o /mnt/nfs/zhujiaqi/test/umi/test1.umi.bam -s 3 --score_threshold 8

下面是IGV图

Are Unmapped reads in the BAM file corrected ? or discarded ?

in overlap position, gencore doesn't reset quality

hi ,
in the document : In the overlapped region, if a base and its pair are mismatched, its quality score will be adjusted to: max(0, this_qual - pair_qual), but i find the quality score doesn't reset,

the 134G8 means G>T in the first read, and the second read is still G, the quality of this position are both D(Q35), so G>T mutation is called by vardict since the quality seems good.

SV calling

I intend to detect structural variation in UMI data, So want to use gencore to process the raw data,so,could you please elaborate on how gencore deals with split reads and discordant read-pairs? And ,how does it deal with read-pairs with one read unmapped, and soft-clipped bases of reads?

Thank you!
tang

what is the max number of interval in a bed for QC report

I tried to generated the QC report with coverage statistic in bed scale,
When I used a smaller bed, such as the example one, I can get the nice plot.
However, when I used a much larger one (such as the WES), the bed plot fail to show up (genome scale plot are unaffected)

Prebuild versions of gencore in opengene.org with version history

i use gencore in my pipeline, and i would like to know if its possible to manage versions like your fastp tool in the opengene.org website ?
a version control of gencore will be very helpful !
thks

opengene / gencore Goto Github PK

gencore's Introduction

gencore

what's gencore?

take a quick glance of the informative report

try gencore to generate above reports

quick examples

get gencore

install with Bioconda

download binary

or compile from source

why to use gencore?

original BAM

gencore processed BAM

QC result reported by gencore

coverate statistics in genome scale

coverate statistics in capturing regions

understand the output

how it works

the quality thresholds

the scoring

command examples

UMI format

UMI examples

all options

citation

gencore's People

Contributors

Stargazers

Watchers

Forkers

gencore's Issues

this binary was compiled on CentOS, and tested on CentOS/Ubuntu

I try the above method, but, I can not use this software.

Recommend Projects

Recommend Topics

Recommend Org