statgen / bamutil Goto Github PK

Home Page: http://genome.sph.umich.edu/wiki/BamUtil

Makefile 0.38% C++ 79.80% Shell 19.49% CMake 0.33%

bamutil's Introduction

Dependencies
------------

On debian type systems (including Ubuntu), add the following packages if they are not already installed (or have your admin add them if you do not have permission):

sudo apt-get install g++ libssl-dev libcurses-perl zlib1g-dev

Building
--------

To compile, from the top level directory, type: make
To test (after compiling), from the top level directory, type: make test

Under the main statgen repository, there are: 
* lib - the library code
** The library code is compiled into libStatGen.a which, after compiling is located at: statgen/lib/libStatGen.a
** After compiling, library headers can all be found in: statgen/lib/include
* src - the tools we developed
** After compling, the executables are found in: statgen/src/bin
* scripts - the scripts we developed

Makefiles
---------
statgen/Makefile.include should contain the definitions that you need for creating software using this library.

statgen/lib/Makefile.lib and statgen/lib/Makefile.src can be used as templates for creating Makefiles for new software.  If possible, just link to them.  They look for a file called Makefile.tool that should be all you need to update for your specific software.  (both Makefiles automatically include Makefile.include)

A similar setup should be used for test code, linking your Makefile to statgen/Makefile.test and defining Makefile.testtool for your specific test.

Other Notes
-----------
* Typically the .o files are compiled into their own directory called obj.

bamutil's People

Contributors

Stargazers

Watchers

bamutil's Issues

bamutil diff - Get record for every read being compared

Hi, I am wondering if I am using the tool correctly. Will there be a record for every QNAME in the bams being compared when comparing MAPQ scores?
I am noticing my diff output includes only a fraction of the QNAME entries of one of my bams being diff'd.

$ bam diff --in1 ${BAM1} --in2 ${BAM2} --mapQual >> diffs.txt
$ cat diffs.txt | grep -v "<" | grep -v ">" | sort | uniq | wc -l.  # get only the read names, no diffs
7199
$ samtools view ${BAM1} | cut -f1 | sort | uniq | wc -l.            # qname/read names in my input bam
46367516

I am trying to determine whether one map/align method mapped reads with greater confidence than another. Thank you for any direction, correction, or help. Thanks!

BamUtil: recab error: failed to open reference genome

Hello,
I am trying to use the TOPMed Variant Calling Pipeline (latest: Freeze 8) to analyze human genetic data. I am following the instructions on the Readme (https://github.com/statgen/topmed_variant_calling) and have aligned my sequenced reads in BAM format (and indexed them) using bwa to the GRCh38 (full analysis set plus decoy hla) reference genome fasta. (Found here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/).

The TOPMed pipeline readme does not mention the steps of deduplicating and recalibrating the BAM files, but as mentioned in the wiki here (https://genome.sph.umich.edu/wiki/GotCloud) and my own knowledge of variant calling pipeline best practices, I believe these steps are highly recommended.

So I would like to use BamUtil to deduplicate and recalibrate my BAM files. Following the instructions here (https://genome.sph.umich.edu/wiki/BamUtil:_recab), my command is:

bamUtil/bin/bam dedup --recab --in input.bam --out input.bam.recab --force --refFile /ref/hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa --dbsnp /ref/dbsnp_142.b38.vcf.gz --oneChrom --storeQualTag OQ --maxBaseQual 40

But it fails with the following error message:

Failed to open reference genome /ref/hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa /ref/hg38/GRCh38_full_analysis_set_plus_decoy_hla-bs.umfa: wrong type of file (expected type 460927905 but got 1096302936)

I also tried the same command as above but using a similar(?) reference genome (hs38DH.fa), which I obtained (following the instructions here: https://github.com/statgen/topmed_variant_calling) using the command:
wget ftp://share.sph.umich.edu/1000genomes/fullProject/hg38_resources/topmed_variant_calling_example_resources.tar.gz

This returned a similar error:

GenomeSequence::open: failed to open file /ref/hs38DH-bs.umfa Failed to open reference genome /ref/hs38DH.fa hs38DH-bs.umfa: wrong type of file (expected type 460927905 but got 1415074904)

I am not sure what it means by "expected type 460927905". I would greatly appreciate any help in solving this issue. Thank you!

clipOverlap in outward facing pairs

Hello,

First of all, thanks for the bamUtil suite!

I'd like to understand the following behavior of clipOverlap. I have a read pair marked as properly paired even if reads face outwards:

    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 
                        >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

When I clip them (command below) I see that the bottom read is clipped on the right and the top read is fully clipped. Result:

    ................................................... 
                        >>>>>>>>>>>>>>>>>>>>>....................

What is the reason for doing this? I would expect one of the two reads to be clipped in the overlapping region only. (Granted such pair is odd anyway but I do find these cases in bwa output).

bam clipOverlap --in outies.sam --out out.sam --readName

Input/Output:

cat outies.sam
@SQ	SN:chr7	LN:159138663
r1	99	chr7	100	255	40M	=	80	0	NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN	*
r1	147	chr7	80	255	40M	=	100	0	NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN	*


cat out.sam
@SQ	SN:chr7	LN:159138663
r1	99	chr7	100	255	20M20S	=	80	0	NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN	*
r1	147	chr7	80	255	40S	=	100	0	NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN	*

Thank you!
Dario

bam2FastQ: fixing output file name whith --gzip

Dear all,

I've been using bam2FastQ (latest version) and noticed that when the --gzip option is set, even though the files are gzipped, the file names are missing the .gz extension. I've tweaked the code to fix this (see attached).

cheers,

Bam2FastQ.cpp.zip

bash -x testBam2FastQ.sh and bash -x testMergeBam.sh failed on centos8_aarch64

Hello,I meet a problem:bash -x testBam2FastQ.sh and bash -x testMergeBam.sh failed on centos8_aarch64

aarch64:
[root@localhost test]# bash -x testMergeBam.sh
+ ERROR=false
+ ../bin/bam mergeBam --out results/mergeBam.bam --list testFiles/mergeBam.list --noph
Unrecognized option nophonehome
+ '[' 255 -ne 0 ']'
+ ERROR=true
+ diff results/mergeBam.bam expected/mergeBam.bam
diff: results/mergeBam.bam: No such file or directory
+ '[' 2 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeSam.sam -l testFiles/mergeSam.list --noph
Unrecognized option nophonehome
+ '[' 255 -ne 0 ']'
+ ERROR=true
+ diff results/mergeSam.sam expected/mergeSam.sam
diff: results/mergeSam.sam: No such file or directory
+ '[' 2 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeSam1.sam -i testFiles/sortedSam.sam -i testFiles/sortedSam1.sam --noph
Unrecognized option nophonehome
+ '[' 255 -ne 0 ']'
+ ERROR=true
+ diff results/mergeSam1.sam expected/mergeSam1.sam
diff: results/mergeSam1.sam: No such file or directory
+ '[' 2 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeSam2.sam -i testFiles/sortedSam1.sam -i testFiles/sortedSam.sam --noph
Unrecognized option nophonehome
+ '[' 255 -ne 0 ']'
+ ERROR=true
+ diff results/mergeSam2.sam expected/mergeSam2.sam
diff: results/mergeSam2.sam: No such file or directory
+ '[' 2 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeSam3.sam -i testFiles/sortedSam.sam -i testFiles/sortedSam.sam --noph
Unrecognized option nophonehome
+ '[' 255 -ne 0 ']'
+ ERROR=true
+ diff results/mergeSam3.sam expected/mergeSam3.sam
diff: results/mergeSam3.sam: No such file or directory
+ '[' 2 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeSam4.sam -i testFiles/sortedSam.sam -i testFiles/sortedSam2.sam --noph
+ '[' 255 -eq 0 ']'
+ '[' -e results/mergeSam4.sam ']'
+ diff results/mergeSam4.log expected/mergeSam4.log
1c1,11
< Unrecognized option nophonehome
---
> ERROR : Failed to add readgroup to header, duplicate, but non-identical RG ID, myID2
> prev: (@RG    ID:myID2        SM:sample2      LB:library2
> )
> new:  (@RG    ID:myID2        SM:sample24     LB:library2
> )
> Exiting due to ERROR:
>       ERROR: Failed to add readgroup to header, duplicate, but non-identical RG ID, myID2
> prev: (@RG    ID:myID2        SM:sample2      LB:library2
> )
> new:  (@RG    ID:myID2        SM:sample24     LB:library2
> )
+ '[' 1 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeSamPI.sam -i testFiles/sortedSam.sam -i testFiles/sortedSamPI.sam --ignorePI --noph
Unrecognized option nophonehome
+ '[' 255 -ne 0 ']'
+ ERROR=true
+ diff results/mergeSamPI.sam expected/mergeSam3.sam
diff: results/mergeSamPI.sam: No such file or directory
+ '[' 2 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeSamPI4.sam -i testFiles/sortedSamPI.sam -i testFiles/sortedSam2.sam -I --noph
+ '[' 255 -eq 0 ']'
+ '[' -e results/mergeSamPI4.sam ']'
+ diff results/mergeSamPI4.log expected/mergeSam4PI.log
1c1,11
< Unrecognized option nophonehome
---
> ERROR : Failed to add readgroup to header, duplicate, but non-identical RG ID, myID2, even when ignoring PI
> prev: (@RG    ID:myID2        SM:sample2      PI:500  LB:library2
> )
> new:  (@RG    ID:myID2        SM:sample24     LB:library2
> )
> Exiting due to ERROR:
>       ERROR: Failed to add readgroup to header, duplicate, but non-identical RG ID, myID2, even when ignoring PI
> prev: (@RG    ID:myID2        SM:sample2      PI:500  LB:library2
> )
> new:  (@RG    ID:myID2        SM:sample24     LB:library2
> )
+ '[' 1 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeSamPI1.sam -i testFiles/sortedSamPI1.sam -i testFiles/sortedSamPI.sam -I --noph
Unrecognized option nophonehome
+ '[' 255 -ne 0 ']'
+ ERROR=true
+ diff results/mergeSamPI1.sam expected/mergeSamPI1.sam
diff: results/mergeSamPI1.sam: No such file or directory
+ '[' 2 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeSamBeforePIfail.sam -i testFiles/sortedSamPI.sam -i testFiles/sortedSamPI2.sam -I --noph
+ '[' 255 -eq 0 ']'
+ '[' -e results/mergeSamBeforePIfail.sam ']'
+ diff results/mergeSamBeforePIfail.log expected/mergeSamPI_2.log
1c1,11
< Unrecognized option nophonehome
---
> ERROR : Failed to add readgroup to header, duplicate, but non-identical RG ID, myID2, even when ignoring PI
> prev: (@RG    ID:myID2        SM:sample2      PI:500  LB:library2
> )
> new:  (@RG    ID:myID2        SM:sample22     PI:500  LB:library2
> )
> Exiting due to ERROR:
>       ERROR: Failed to add readgroup to header, duplicate, but non-identical RG ID, myID2, even when ignoring PI
> prev: (@RG    ID:myID2        SM:sample2      PI:500  LB:library2
> )
> new:  (@RG    ID:myID2        SM:sample22     PI:500  LB:library2
> )
+ '[' 1 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeSamAfterPIfail.sam -i testFiles/sortedSamPI.sam -i testFiles/sortedSamPI3.sam -I --noph
+ '[' 255 -eq 0 ']'
+ '[' -e results/mergeSamAfterPIfail.sam ']'
+ diff results/mergeSamAfterPIfail.log expected/mergeSamPI_3.log
1c1,11
< Unrecognized option nophonehome
---
> ERROR : Failed to add readgroup to header, duplicate, but non-identical RG ID, myID2, even when ignoring PI
> prev: (@RG    ID:myID2        SM:sample2      PI:500  LB:library2
> )
> new:  (@RG    ID:myID2        SM:sample2      PI:500  LB:library22
> )
> Exiting due to ERROR:
>       ERROR: Failed to add readgroup to header, duplicate, but non-identical RG ID, myID2, even when ignoring PI
> prev: (@RG    ID:myID2        SM:sample2      PI:500  LB:library2
> )
> new:  (@RG    ID:myID2        SM:sample2      PI:500  LB:library22
> )
+ '[' 1 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeSamPIfail.sam -i testFiles/sortedSamPI.sam -i testFiles/sortedSamPI1.sam --noph
+ '[' 255 -eq 0 ']'
+ '[' -e results/mergeSamPIfail.sam ']'
+ diff results/mergeSamPIfail.log expected/mergeSamPIfail.log
1c1,11
< Unrecognized option nophonehome
---
> ERROR : Failed to add readgroup to header, duplicate, but non-identical RG ID, myID2
> prev: (@RG    ID:myID2        SM:sample2      PI:500  LB:library2
> )
> new:  (@RG    ID:myID2        SM:sample2      PI:501  LB:library2
> )
> Exiting due to ERROR:
>       ERROR: Failed to add readgroup to header, duplicate, but non-identical RG ID, myID2
> prev: (@RG    ID:myID2        SM:sample2      PI:500  LB:library2
> )
> new:  (@RG    ID:myID2        SM:sample2      PI:501  LB:library2
> )
+ '[' 1 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeBamReg1List.sam -l testFiles/mergeBam.list --noph -r 1
Unrecognized option nophonehome
+ '[' 255 -ne 0 ']'
+ ERROR=true
+ diff results/mergeBamReg1List.sam expected/mergeBamReg1.sam
diff: results/mergeBamReg1List.sam: No such file or directory
+ '[' 2 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeBamReg1File.sam -l testFiles/mergeBam.list --noph -R testFiles/region1.txt
Unrecognized option nophonehome
+ '[' 255 -ne 0 ']'
+ ERROR=true
+ diff results/mergeBamReg1File.sam expected/mergeBamReg1.sam
diff: results/mergeBamReg1File.sam: No such file or directory
+ '[' 2 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeBamReg13List.sam -l testFiles/mergeBam.list --noph -r 1,3
Unrecognized option nophonehome
+ '[' 255 -ne 0 ']'
+ ERROR=true
+ diff results/mergeBamReg13List.sam expected/mergeBamReg13.sam
diff: results/mergeBamReg13List.sam: No such file or directory
+ '[' 2 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeBamReg13File.sam -l testFiles/mergeBam.list --noph -R testFiles/region13.txt
Unrecognized option nophonehome
+ '[' 255 -ne 0 ']'
+ ERROR=true
+ diff results/mergeBamReg13File.sam expected/mergeBamReg13.sam
diff: results/mergeBamReg13File.sam: No such file or directory
+ '[' 2 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeBamReg1P3List.sam -l testFiles/mergeBam.list --noph -r 1:75-1011,1:1750-1750,3
Unrecognized option nophonehome
+ '[' 255 -ne 0 ']'
+ ERROR=true
+ diff results/mergeBamReg1P3List.sam expected/mergeBamReg1P3.sam
diff: results/mergeBamReg1P3List.sam: No such file or directory
+ '[' 2 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeBamReg1P3File.sam -l testFiles/mergeBam.list --noph -R testFiles/region1P3.txt
Unrecognized option nophonehome
+ '[' 255 -ne 0 ']'
+ ERROR=true
+ diff results/mergeBamReg1P3File.sam expected/mergeBamReg1P3.sam
diff: results/mergeBamReg1P3File.sam: No such file or directory
+ '[' 2 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeBamReg1PList.sam -l testFiles/mergeBam.list --noph -r 1:1014
Unrecognized option nophonehome
+ '[' 255 -ne 0 ']'
+ ERROR=true
+ diff results/mergeBamReg1PList.sam expected/mergeBamReg1P.sam
diff: results/mergeBamReg1PList.sam: No such file or directory
+ '[' 2 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeBamReg1PFile.sam -l testFiles/mergeBam.list --noph -R testFiles/region1P.txt
Unrecognized option nophonehome
+ '[' 255 -ne 0 ']'
+ ERROR=true
+ diff results/mergeBamReg1PFile.sam expected/mergeBamReg1P.sam
diff: results/mergeBamReg1PFile.sam: No such file or directory
+ '[' 2 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeErr.sam -l testFiles/mergeBam.list --noph -r 1:1r-5
Unrecognized option nophonehome
+ '[' 255 -eq 0 ']'
+ diff results/mergeErr1.err expected/mergeErr1.err
0a1,4
>
> FATAL ERROR -
> Error: Invalid region string, '1:1r-5', the start position, '1r', is not an integer >= 1.
>
+ '[' 1 -ne 0 ']'
+ ERROR=true
+ ../bin/bam mergeBam -o results/mergeBamReg1File.sam -l testFiles/mergeBam.list --noph -R testFiles/region1err.txt
Unrecognized option nophonehome
+ '[' 255 -eq 0 ']'
+ diff results/mergeErr2.err expected/mergeErr2.err
0a1,4
>
> FATAL ERROR -
> Error: Invalid region string, '1:4-5k', the end position, '5k', is not an integer >= 1.
>
+ '[' 1 -ne 0 ']'
+ ERROR=true
+ true == true
+ echo 'Failed testMergeBam.sh'
Failed testMergeBam.sh
+ exit 1

However, the two test files on ccentos8_x86_64 have passed. Why is this the case? Is aarch64 not supported?

No output

I sometimes bump into this problem with bam diff tool, it seems the comparison does not begin. unfortunately i couldn't upload bam/sam files here.

Failed to open bam fille

Hi, I am a newbie in Github and analyzing NGS data using command line, I hope I am writing this question in the right place. i am trying to run clipoverlap on some bam files. I get an error:
FAIL_IO: Failed to Open "THE BAM FILE NAME".
The directory if right and I can not figure out why I am getting this error.

Generating interleaved FASTQ for bam2FastQ?

Any chance of adding the option to bam2FastQ to generate an interleaved FASTQ file for paired-end reads? Currently it defaults to 2 FASTQ files. Some tools (e.g. bwa) will handle this form of data, e.g. the "-p" option for the "bwa mem" command (http://bio-bwa.sourceforge.net/bwa.shtml). It also makes it a lot easier to pipe the output from bam2FastQ directly into bwa without writing intermediate files. If it's not a feature that's currently planned, I'd be willing to look into what would be required to create a patch. Thanks!

bamutil diff speed ups?

I am trying to generate 2 bam files that are the differences between 2 bams using bamutil diff.
http://genome.sph.umich.edu/wiki/BamUtil:_diff#Usage

After trying it for a while, I find it's exaclty what I need, but it's taken a long time (more than 10 hours and running) to compare two human 30x builds:

~/src/bamUtil/bin/bam diff --mapQual --onlyDiffs --recPoolSize 100000 --posDiff 100000 --in1 $bam1 --in2 $bam2 --out onlyDiffs.bam

Any ideas what I can do to speed it up?

the clang compiler does not support -pg option on versions of OS X 10.9 and later

Hi there,

I was trying to install bamUtil.

However, it shows the following error:

clang: error: the clang compiler does not support -pg option on versions of OS X 10.9 and later
make[1]: *** [../bin/profile/bam] Error 1
make: *** [src] Error 2

I wonder what should I do?

Thank you!

Assumption made in code masked by earlier check.

This is super low priority but the code here assumes that the first read is the forward read and the second read is the reverse. Everywhere that this function is called ensures this but any new code that calls this function may not. This also assumes that the reverse read does not extend past the beginning of the forward read. This is dependent on the sequencing technique and may not be future proof.

Update wiki build insturctions

Build instructions on http://genome.sph.umich.edu/wiki/BamUtil vs GitHub are different.

‘DATE’ / 'USER' was not declared in this scope

While compiling bamUtil v1.0.9 using the following command:

make USER_COMPILE_VARS=-mmacosx-version-min=10.8

I ran into this error:

g++  -O4 -pipe -Wall  -I../../libStatGen/include -I.   -D__ZLIB_AVAILABLE__ -D_FILE_OFFSET_BITS=64 -D__STDC_LIMIT_MACROS -mmacosx-version-min=10.8 -o ../obj/BamExecutable.o -c  BamExecutable.cpp -DVERSION="\"1.0.9\""
BamExecutable.cpp: In static member function ‘static void BamExecutable::bamVersion()’:
BamExecutable.cpp:41: error: ‘DATE’ was not declared in this scope
BamExecutable.cpp:41: error: ‘USER’ was not declared in this scope

Using this command to compile allowed me to proceed:

make "USER_COMPILE_VARS=-mmacosx-version-min=10.8 -DUSER=\"\\\"lfo\\\"\" -DDATE=20150215"

Building bamUtil returns error: call of overloaded 'to_string(int&)' is ambiguous

Hi there,
I am encountering an error while building this repository using the following codes:

git clone git://github.com/statgen/bamUtil.git
cd bamUtil
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
module load gcc/v6.3.0
make cloneLib
make ## here comes the error
...
VcfRecord.cpp: In member function 'bool VcfRecord::read(InputFile*, bool, VcfRecordDiscardRules&, VcfSubsetSamples*)':
VcfRecord.cpp:160: error: call of overloaded 'to_string(int&)' is ambiguous
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int)
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int)
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double)
VcfRecord.cpp: In member function 'bool VcfRecord::write(InputFile*, bool)':
VcfRecord.cpp:199: error: call of overloaded 'to_string(int&)' is ambiguous
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int)
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int)
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double)
make[3]: *** [obj/VcfRecord.o] Error 1
make[2]: *** [vcf] Error 2
make[1]: *** [../../libStatGen/libStatGen.a] Error 2
make[1]: Leaving directory `/proj/data9/samuel/modules/bamUtil/src'
make: *** [src] Error 2

I am working on a HPC cluster that uses linux
Any suggestion to avoid the ambiguities?

compare reads ignoring read id

Hi,

I would like to compare two bam files ignoring the read ids: if two reads contain the same sequence string but different read ids, I would like bam diff to consider them the same, and skip them in the _only1 and _only2 files when running bam diff --in1 bam1.bam --in2 bam2.bam --out my.bam

I have modified the code below, but I would like to know if this should be enough or should be combined with specific parameters when running bam diff.

Thanks in advance.

bam diff matches records by different parameters and also by

ReadName. I modified the code so that it doesnt look at the readname

when doing the comparison

cd /home/avilella/src/bamUtil/src

~/src/bamUtil/src/Diff.cpp

Around line 463

bool Diff::matchingRecs(SamRecord* rec1, SamRecord* rec2)
{
if((rec1 == NULL) || (rec2 == NULL))
{
// one or both of the records is NULL, so return false.
return(false);
}

// // Have 2 records, so compare them.
// if((SamFlag::getFragmentType(rec1->getFlag()) == 
//     SamFlag::getFragmentType(rec2->getFlag())) &&
//    (strcmp(rec1->getReadName(),
//            rec2->getReadName()) == 0))
// Have 2 records, so compare them.
if(SamFlag::getFragmentType(rec1->getFlag()) == 
    SamFlag::getFragmentType(rec2->getFlag()))
{
    // Same fragment and read name.
    return(true);
}
return(false);

}

bam dedup --in /dev/stdin --out /dev/stdout does not work.

I want to use bam dedup in a pipe to save disk space and accesses:

$ cat input.sbam | bam dedup --in - --out - --log - > output1.sam
terminate called after throwing an instance of 'std::runtime_error'
what(): FAIL_PARSE: Too few columns (1) in the Record, expected at least 11.
Aborted

The same happens when I use /dev/stdin instead of '-'.

Furthermore, the output of the above command is sam formatted and in unsorted order. If I compare the bam formatted and sorted output of the previous command with the output of the command using an output file instead of stdout, I get more or less the same results:

$ bam dedup --in input.sbam --out output2.sbam --log -
$ samtools view -S -b output1.sam | samtools sort - output1.sbam
$ md5sum <(samtools view outpu1.bam | cut -f -11) <(samtools view output2.sbam | cut -f -11)
f026615da44eee232c9e560635dba161 /dev/fd/63
f026615da44eee232c9e560635dba161 /dev/fd/62

The cut -f 11 is to remove the attributes column because the two invocations produce different attribute order.

It would be desirable if output to standard out would produce sorted bam.

Philip

`bam diff` seems to return the reads that look identical between A and B

Hi,

I have two BAM files that I'd like to compare. Each is about 5.6GB. I expect them to be identical (I'm sort of doing a reproducibility test).

When I ran with the following command:

bam diff --in1 a.bam --in2 b.bam --all --onlyDiffs --recPoolSize -1 --out c.bam

It generated three files:

-rw-r--r-- 1    2373574 Jun 13 15:12 c.bam
-rw-r--r-- 1    1803478 Jun 13 15:12 c_only1_a.bam
-rw-r--r-- 1    1803105 Jun 13 15:12 c_only2_b.bam

I tried to see what actually differs between the two, but I think they look identical. My suspicion is maybe something to do with the muti-mapped reads. Do you have any idea how to resolve this?

samtools view c.bam | head -n1
:TAGATCGCAAATGGTA:CTTATCAAGCGC:;A00228:279:HFWFVDMXX:1:1229:8477:28792	0	1	629088	1	46M1I44M	*	0	0	GTCCGAACTAGTATCAGGCTTCAAAATCGAATACGCCGCAGGCCCCCTTCGCCCTATTCTTCATAGCAGAATACACAAACATTATTATAAT	,:FF,FFFFFF:,,FF:FFFFFFF,F::FFF,FF,FF:FFFF:FFF,:F:F,FFFFFF,F,FF:FFF,FFF:FFFFFFFFFFFFF:FFFFF	AS:i:78	HI:i:4	NH:i:4	nM:i:3	ZC:Z:42M1I48M	ZT:Z:AS:i:78;HI:i:3;NH:i:4;nM:i:3

$ samtools view a.bam | grep -F ":TAGATCGCAAATGGTA:CTTATCAAGCGC:;A00228:279:HFWFVDMXX:1:1229:8477:28792"
:TAGATCGCAAATGGTA:CTTATCAAGCGC:;A00228:279:HFWFVDMXX:1:1229:8477:28792	0	1	629088	1	42M1I48M	*	0	0	GTCCGAACTAGTATCAGGCTTCAAAATCGAATACGCCGCAGGCCCCCTTCGCCCTATTCTTCATAGCAGAATACACAAACATTATTATAAT	,:FF,FFFFFF:,,FF:FFFFFFF,F::FFF,FF,FF:FFFF:FFF,:F:F,FFFFFF,F,FF:FFF,FFF:FFFFFFFFFFFFF:FFFFF	NH:i:4	HI:i:3	AS:i:78	nM:i:3
:TAGATCGCAAATGGTA:CTTATCAAGCGC:;A00228:279:HFWFVDMXX:1:1229:8477:28792	0	1	629088	1	46M1I44M	*	0	0	GTCCGAACTAGTATCAGGCTTCAAAATCGAATACGCCGCAGGCCCCCTTCGCCCTATTCTTCATAGCAGAATACACAAACATTATTATAAT	,:FF,FFFFFF:,,FF:FFFFFFF,F::FFF,FF,FF:FFFF:FFF,:F:F,FFFFFF,F,FF:FFF,FFF:FFFFFFFFFFFFF:FFFFF	NH:i:4	HI:i:4	AS:i:78	nM:i:3

$ samtools view b.bam | grep -F ":TAGATCGCAAATGGTA:CTTATCAAGCGC:;A00228:279:HFWFVDMXX:1:1229:8477:28792"
:TAGATCGCAAATGGTA:CTTATCAAGCGC:;A00228:279:HFWFVDMXX:1:1229:8477:28792	0	1	629088	1	42M1I48M	*	0	0	GTCCGAACTAGTATCAGGCTTCAAAATCGAATACGCCGCAGGCCCCCTTCGCCCTATTCTTCATAGCAGAATACACAAACATTATTATAAT	,:FF,FFFFFF:,,FF:FFFFFFF,F::FFF,FF,FF:FFFF:FFF,:F:F,FFFFFF,F,FF:FFF,FFF:FFFFFFFFFFFFF:FFFFF	NH:i:4	HI:i:3	AS:i:78	nM:i:3
:TAGATCGCAAATGGTA:CTTATCAAGCGC:;A00228:279:HFWFVDMXX:1:1229:8477:28792	0	1	629088	1	46M1I44M	*	0	0	GTCCGAACTAGTATCAGGCTTCAAAATCGAATACGCCGCAGGCCCCCTTCGCCCTATTCTTCATAGCAGAATACACAAACATTATTATAAT	,:FF,FFFFFF:,,FF:FFFFFFF,F::FFF,FF,FF:FFFF:FFF,:F:F,FFFFFF,F,FF:FFF,FFF:FFFFFFFFFFFFF:FFFFF	NH:i:4	HI:i:4	AS:i:78	nM:i:3

Building bamUtil returns error: undefined reference

Hello,
I am working on a HPC cluster that uses linux x86_64. I am encountering an error while building this repository using the following commands:

git clone https://github.com/statgen/bamUtil.git
cd bamUtil
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
module load dev/gcc/8.2
module load dev/PGI-compilers/17.5
make cloneLib
make ## here comes the error

[..]
g++ -std=c++0x -I/usr/local/packages/dev/gcc/8.2.0/include -I../../libStatGen/include -I. -D__ZLIB_AVAILABLE__ -D_FILE_OFFSET_BITS=64 -D__STDC_LIMIT_MACROS -DDATE=""Wed 5 Dec 17:31:32 GMT 2018"" -DVERSION=""1.0.14"" -DUSER=""bo1fra"" -o ../obj/BamExecutable.o -c BamExecutable.cpp -DVERSION=""1.0.14""
mkdir -p ../bin
g++ -std=c++0x -I/usr/local/packages/dev/gcc/8.2.0/include -I../../libStatGen/include -I. -D__ZLIB_AVAILABLE__ -D_FILE_OFFSET_BITS=64 -D__STDC_LIMIT_MACROS -DDATE=""Wed 5 Dec 17:31:32 GMT 2018"" -DVERSION=""1.0.14"" -DUSER=""bo1fra"" -o ../bin/bam ../obj/BamExecutable.o ../obj/Validate.o ../obj/Convert.o ../obj/Diff.o ../obj/DumpHeader.o ../obj/SplitChromosome.o ../obj/WriteRegion.o ../obj/DumpIndex.o ../obj/ReadIndexedBam.o ../obj/DumpRefInfo.o ../obj/Filter.o ../obj/ReadReference.o ../obj/Revert.o ../obj/Squeeze.o ../obj/FindCigars.o ../obj/Stats.o ../obj/PileupElementBaseQCStats.o ../obj/ClipOverlap.o ../obj/MateMapByCoord.o ../obj/SplitBam.o ../obj/TrimBam.o ../obj/MergeBam.o ../obj/PolishBam.o ../obj/GapInfo.o ../obj/Logger.o ../obj/Bam2FastQ.o ../obj/Dedup.o ../obj/Dedup_LowMem.o ../obj/Prediction.o ../obj/LogisticRegression.o ../obj/MathCholesky.o ../obj/HashErrorModel.o ../obj/Recab.o ../obj/OverlapHandler.o ../obj/OverlapClipLowerBaseQual.o ../obj/ExplainFlags.o ../obj/Main.o ../../libStatGen/libStatGen.a -lm -lz
../obj/Validate.o: In function operator<<(std::ostream&, SamValidationErrors const&)': Validate.cpp:(.text._ZlsRSoRK19SamValidationErrors[_ZlsRSoRK19SamValidationErrors]+0x4d): undefined reference to SamValidationErrors::getErrorString(std::__cxx11::basic_string<char, std::char_traits, std::allocator >&) const'
../obj/DumpHeader.o: In function DumpHeader::dumpHeader(char const*)': DumpHeader.cpp:(.text+0x21f): undefined reference to SamFileHeader::getHeaderString(std::__cxx11::basic_string<char, std::char_traits, std::allocator >&) const'
../obj/WriteRegion.o: In function WriteRegion::execute(int, char**)': WriteRegion.cpp:(.text+0xe11): undefined reference to InputFile::readTilChar(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator >&)'
../obj/ReadReference.o: In function ReadReference::execute(int, char**)': ReadReference.cpp:(.text+0x4b3): undefined reference to GenomeSequence::getString(std::__cxx11::basic_string<char, std::char_traits, std::allocator >&, unsigned int, int) const'
../obj/MergeBam.o: In function parseOutRG(SamFileHeader&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, SamFileHeader*, bool)': MergeBam.cpp:(.text+0x28aa): undefined reference to SamHeaderRecord::appendString(std::__cxx11::basic_string<char, std::char_traits, std::allocator >&)'
MergeBam.cpp:(.text+0x28cc): undefined reference to SamHeaderRecord::appendString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' MergeBam.cpp:(.text+0x2c18): undefined reference to SamHeaderRecord::appendString(std::__cxx11::basic_string<char, std::char_traits, std::allocator >&)'
MergeBam.cpp:(.text+0x2c3a): undefined reference to SamHeaderRecord::appendString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' MergeBam.cpp:(.text+0x2cee): undefined reference to SamHeaderRecord::appendString(std::__cxx11::basic_string<char, std::char_traits, std::allocator >&)'
MergeBam.cpp:(.text+0x2d16): undefined reference to SamFileHeader::appendCommentLines(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' ../obj/Bam2FastQ.o: In function Bam2FastQ::writeFastQ(SamRecord&, InputFile*, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, char const*)':
Bam2FastQ.cpp:(.text+0x2c0b): undefined reference to SamHeaderRecord::appendString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' Bam2FastQ.cpp:(.text+0x2f3f): undefined reference to BaseUtilities::reverseComplement(std::__cxx11::basic_string<char, std::char_traits, std::allocator >&)'
../obj/Dedup.o: In function Dedup::buildReadGroupLibraryMap(SamFileHeader&)': Dedup.cpp:(.text+0x277c): undefined reference to SamHeaderRecord::appendString(std::__cxx11::basic_string<char, std::char_traits, std::allocator >&)'
Dedup.cpp:(.text+0x2868): undefined reference to SamHeaderRecord::appendString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' ../obj/Dedup_LowMem.o: In function Dedup_LowMem::buildReadGroupLibraryMap(SamFileHeader&)':
Dedup_LowMem.cpp:(.text+0x25a6): undefined reference to SamHeaderRecord::appendString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' Dedup_LowMem.cpp:(.text+0x2692): undefined reference to SamHeaderRecord::appendString(std::__cxx11::basic_string<char, std::char_traits, std::allocator >&)'
collect2: error: ld returned 1 exit status
make[1]: *** [../bin/bam] Error 1
make[1]: Leaving directory `/home/bo1fra/software/bamUtil/src'
make: *** [src] Error 2

Any suggestion?
Thank you in advance

Compilation problems on Linux

BamUtilLibStatGen.1.0.7.tgz

Compiling this on my host (Gentoo Linux) using the toplevel makefile results in a series of errors. I made it compiling by changing the CFLAGS and CXXFLAGS.

(1) The UINT16_MAX and similar variables were not available by default in my compiler (GCC 4.6.3). I had to add -D__STDC_LIMIT_MACROS to the C[XX]FLAGS.

(2) The declaration of BgzfFileType in libStatGen require -D__ZLIB_AVAILABLE__, otherwise one of the files in the bamUtil directory fails.

(3) I had to add the /general and /include directories to the C[XX]FLAGS, because the defaults in the Makefile didn't work already when compiling libStatGen.

(4) Finally, when compiling BamExecutable.cpp somehow the DATE and USER variable were not defined.

BamExecutable.cpp: In static member function 'static void BamExecutable::bamVersion()':
BamExecutable.cpp:41:33: error: 'DATE' was not declared in this scope
BamExecutable.cpp:41:50: error: 'USER' was not declared in this scope

So I had to add -DDATE='"$(date -I)"' -DUSER='"$(whoami)"'"

I hope that helps.

Philip

bam recab: parallel processing

Hello there and thanks for bamUtil.

I have an enhancement request: when using bamUtil for massaging of BAM files one real bottleneck seems to be the recalibration step (bam recab), because it cannot be run on multiple cores. At least conceptually it looks like it should be straightforward to collect the required information per region (running the analysis in parallel for many regions) and then aggregate at the end to give the final table. Applying that table can be a single core process of course.

Are there any plans to implement something along these lines? Or is something similar already implemented somewhere and I missed it in the documentation?

Thanks
Andreas

Indexing the clipped bam files

Hello, I am facing a problem with indexing the clipped bam files using samtools. I am getting errors like bellow. I don't have the same problem when indexing the original bam files (Unclipped). I am not sure if I should ignore this or this is something important and I should try to solve it. Any idea what I should do to solve this problem?

read 'HS24_57:7:2110:1654:80872' mapped to '5' at POS 180043777 to 180043776 has BIN 15670 but should be 1958
In total, there are 9 reads with incorrect BIN fields
Fix this by converting BAM->SAM->BAM to force BIN recalculation

CIGAR does not evaluate to the same length as SEQ

This warning is produced by validate even if the SEQ is set to *. The SAM spec explicitly states that it is ok for the SEQ to be * and that the lengths don't have to match:

If not a ‘*’, the length of the sequence must equal the sum of lengths of M/I/S/=/X operations in CIGAR

It even states that SEQ should be * for secondary alignments:

SEQ and QUAL of secondary alignments should be set to ‘*’ to reduce the file size.

Or am I misunderstanding something?

Thanks!

1.0.13 labeled as the "latest release" under releases

Hi, first of all, thanks for making and maintaining bamUtil and it has been very helpful for my research.

For some reason, on the releases page, 1.0.13 is still labeled as the "Latest release", although 1.0.14 is available and this seems confusing for some people. Could you please update the latest release label to 1.0.14? Thank you!

Error building from source: libStatGen

I'm working on a Linux based HPC working with private data hence the environment is restrictive (no internet connection); therefore packages must be built from source.

After downloading the file and transferring it to the cluster I try to build by specifying the PATH for libStatGen but I get an error:

$ wget https://github.com/statgen/bamUtil/archive/v1.0.14.tar.gz

$ tar -zxvf v1.0.14.tar.gz
$ cd bamUtil-1.0.14
$ make LIB_PATH_GENERAL=/project/M-mtgraovac182840/tools/libStatGen
make[1]: Entering directory '/project/M-mtgraovac182840/tools/bamUtil-1.0.14/src'
make -C /project/M-mtgraovac182840/tools/libStatGen --no-print-directory opt
ar -cru ../libStatGen.a obj/bgzf.o obj/knetfile.o
g++  -O4 -pipe -Wall -Werror  -Wno-strict-overflow -I../include -I.   -D__ZLIB_AVAILABLE__ -D_FILE_OFFSET_BITS=64 -D__STDC_LIMIT_MACROS  -o obj/Chromosome.o -c Chromosome.cpp -DVERSION="\"1.0.0\""
Chromosome.cpp: In constructor 'Chromosome::Chromosome(const string&, unsigned int, bool)':
Chromosome.cpp:42:15: error: '*<unknown>.Chromosome::gs' is used uninitialized in this function [-Werror=uninitialized]
     if (this->gs) delete gs;
         ~~~~~~^~
cc1plus: all warnings being treated as errors
make[3]: *** [../Makefiles/Makefile.common:81: obj/Chromosome.o] Error 1
make[2]: *** [Makefiles/Makefile.base:15: general] Error 2
make[1]: *** [/project/M-mtgraovac182840/tools/libStatGen/Makefiles/Makefile.ext:50: /project/M-mtgraovac182840/tools/libStatGen/libStatGen.a] Error 2
make[1]: Leaving directory '/project/M-mtgraovac182840/tools/bamUtil-1.0.14/src'
make: *** [/project/M-mtgraovac182840/tools/libStatGen/Makefiles/Makefile.base:15: src] Error 2

clipoverlap on raw output straight from bwa

In a more recent update, the primary way we use clipoverlap was broken by --readname enforcing alphanumeric sorting of readnames. BWA's raw SAM output is readname sorted in that readpairs (with of course the same name) are next to each other in the order read from the fastq files but not in alphanumeric sorted order.

The issue is order of operations and how you pipe commands together. As it stands with the latest versions, we have to align, then coordinate sort, then clipoverlap and mark completely clipped reads as unmapped, with then essentially ruins the proper coordinate sorted order. What we'd like to do, and what make the most sense computationally, would be to align, pipe that raw output in to clipoverlap while marking completely clipped reads as unmapped, then coordinate sort and index.

Would it be possible for the --readname option to drop its alphanumeric enforcement and go back to just readpairs must be adjacent to each other? Or add a --bwasam option for streaming straight from BWA?

Thanks!
Ben

Make Error with CentOS

HashErrorModel.cpp: In member function ‘uint8_t HashErrorModel::getQemp(BaseData&)’:
HashErrorModel.cpp:96: error: ‘class std::unordered_map<long unsigned int, HashErrorModel::SMatches, std::hash, std::equal_to, std::allocator<std::pair<const long unsigned int, HashErrorModel::SMatches> > >’ has no member named ‘at’
make[2]: *** [../obj/HashErrorModel.o] Error 1
make[2]: Leaving directory /auto/cmb-01/fs3/lxia/setup/bamUtil_1.0.7/bamUtil/src' make[1]: *** [src] Error 2 make[1]: Leaving directory/auto/cmb-01/fs3/lxia/setup/bamUtil_1.0.7/bamUtil'
make: *** [bamUtil/] Error 2

Looks like my stdlibc++ is old. But I cann't change it c'z I am not root. Could it be fixed from source code side to use more backward compatible expressions?

Thanks

cannot consume -.ubam on a pipe

samtools view ... | bam trimBam -.ubam -.ubam --clip -R

produces this error:

bam: invalid option -- '.'
ERROR: Unrecognized option left

works ok with

samtools view ...  |bam trimBam - - --clip -R

does work though

looks like a getopt problem to me.

building on OSX 10.6.8

did as in README (git)
These files provide some programs for working on SAM/BAM files.--------------------------------------------------------------------------------
To use git to clone the required statgen library:
make cloneLib
Next, to build libStatGen & this program:
make

many lines
......
then
ar -cru ../libStatGen.a obj/GlfException.o obj/GlfFile.o obj/GlfHeader.o obj/GlfRecord.o obj/GlfRefSection.o obj/GlfStatus.o
/opt/local/bin/ranlib: file: ../libStatGen.a(LongHash.o) has no symbols
/opt/local/bin/ranlib: file: ../libStatGen.a(TrimSequence.o) has no symbols
/opt/local/bin/ranlib: file: ../libStatGen.a(WindowsHelper.o) has no symbols
/opt/local/bin/ranlib: file: ../libStatGen.a(SamFlag.o) has no symbols
mkdir -p ../obj
g++ -O4 -pipe -Wall -I../../libStatGen/include -I. -D__ZLIB_AVAILABLE__ -D_FILE_OFFSET_BITS=64 -D__STDC_LIMIT_MACROS -std=c++0x -DDATE=""Sat Aug 25 12:23:40 CEST 2012"" -DVERSION=""1.0.2"" -DUSER=""splaisan"" -o ../obj/BamExecutable.o -c BamExecutable.cpp -DVERSION=""1.0.2""
cc1plus: error: unrecognized command line option "-std=c++0x"
make[1]: *** [../obj/BamExecutable.o] Error 1
make: *** [src] Error 2

Any idea what is missing here?
It worked fine on RHEL but not on my laptop
Thanks

build is broken due to deprecated git protocol

Running make cloneLib is broken with the following error:

Cloning into 'libStatGen'...
fatal: remote error: 
  The unauthenticated git protocol on port 9418 is no longer supported.
Please see https://github.blog/2021-09-01-improving-git-protocol-security-github/ for more information.

According to the linked blog post:

unencrypted git:// offers no integrity or authentication, making it subject to tampering. We expect very few people are still using this protocol, especially given that you can’t push (it’s read-only on GitHub). We’ll be disabling support for this protocol.

Also in the same blog post:

January 11, 2022 | Final brownout.This is the full brownout period where we’ll temporarily stop accepting the deprecated key and signature types, ciphers, and MACs, and the unencrypted Git protocol. This will help clients discover any lingering use of older keys or old URLs.

Which explains why we're seeing the error now.

Odd stats reported for clipOverlap

Dear bamUtils,

I'm using bam clipOverlap --stats --poolSize 100000000 --in PE.bam --out merge.bam and get reports like the below for a handful of files. The alignments all look clean before and after, it's only the stats that don't make too much sense. Is there something I might be missing? This is for ~110bp (after cutadapt, cleaning) PE RNAseq - the imbalance between the fw/rev strands can be attributed to rRNA. The numbers I'm concerned with are the 'Average # Reference Bases Overlapped' and 'Variance of Reference Bases overlapped' as they seem abnormally high.

Overlap Statistics:
Number of overlapping pairs: 31917921
Average # Reference Bases Overlapped: 1178.49
Variance of Reference Bases overlapped: 3.93918e+07
Number of times orientation causes additional clipping: 11373
Number of times the forward strand was clipped: 22035804
Number of times the reverse strand was clipped: 9882117
WARNING: did not find expected overlapping mates for 173173 records.
Completed ClipOverlap Successfully.

Compiling issue

When I run "make" to compile libStatGen and bamUtil, I get the following errors:

In file included from ClipOverlap.cpp:22:
In file included from ./ClipOverlap.h:27:
./MateMapByCoord.h:70:19: error: no type named 'unordered_multimap' in namespace 'std'
     typedef std::unordered_multimap<uint64_t, SamRecord*> MATE_MAP; 
             ~~~~~^
./MateMapByCoord.h:70:37: error: expected member name or ';' after declaration specifiers
     typedef std::unordered_multimap<uint64_t, SamRecord*> MATE_MAP; 
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
./MateMapByCoord.h:73:5: error: unknown type name 'MATE_MAP'
    MATE_MAP myMateBuffer;
    ^
2 warnings and 3 errors generated.
make[1]: *** [../obj/ClipOverlap.o] Error 1
make: *** [src] Error 2

Then, running make install INSTALLDIR=mydirectory fails to install files into the specified directory and yields the same errors as above.

I've tried fresh installing and updating with no change.

Thanks for your help.

bam diff when 2 reads have indentical names AND start positions?

I'm trying to automatically compare BAM files being output by bowtie2 (for a continuous integration system). The input data is identical, but when running on 2 different machines I'm noticing sometimes 2 lines in the BAM files will have the same name/start position, but represent different reads. For whatever reason, bowtie2 sometimes reverses the order of the lines, so bamUtil's algorithm calls a mismatch.

Is there any way to get bam diff to hold off for a few lines to see if there is another record with the same name/start pos that actually matches?

Dedup does not work from std/in

I have a small MiSeq bam file (in.bam) that passes the bamutil validate --in in.bam --so_coord check:

Number of records read = 44491
Number of valid records = 44491

TotalReads	44491.00
MappedReads	44251.00
PairedReads	44491.00
ProperPair	43249.00
DuplicateReads	0.00
QCFailureReads	0.00

MappingRate(%)	99.46
PairedReads(%)	100.00
ProperPair(%)	97.21
DupRate(%)	0.00
QCFailRate(%)	0.00

TotalBases	6645701.00
BasesInMappedReads	6609701.00
Returning: 0 (SUCCESS)

When I run with direct file input bamutil dedup --in in.bam --out del.bam --verbose it works as expected

Writing del.bam
Successfully marked 9 unpaired and 46 paired duplicate reads

When I try piping the SAM, it detects no duplicates samtools view -h in.bam | bamutil dedup --in - --out del.bam --verbose:

Writing del.bam
Successfully marked 0 unpaired and 0 paired duplicate reads

It throws an error when I try to pipe uncompressed bam file samtools view -hub in.bam | bamutil dedup --in -.ubam --out del.bam --verbose:

Sorting the indices of 101 duplicated records
Exiting due to ERROR:
	FAIL_ORDER: Cannot read header since the file pointer is null

My bamutil version:

Set of tools for operating on SAM/BAM files.
Version: 1.0.14; Built: Thu Apr  4 10:14:43 EDT 2019 by slazic

How to cite bamUtil

Is there a publication or preprint for us to cite bamUtil? If not, should we just cite the webpage (https://genome.sph.umich.edu/wiki/BamUtil)? Thanks.

Strelka output using Clipped bams not the same as unclipped bams

Hello,

I noticed that output of Strelka for detecting variants is different when using clipped bams(using ClipOverlap) with using unmodified bams. This is an important issue because I don't know how I can trust the trimmed bams. An specific example is a deletion that is being supported by atleast 4 unpaired reads and this deletion has not been detected with Strelka using the Clipped bams. Do you know what can be the problem here?

new mapqual diff option

Hi,

I would like to introduce a new mapQual Diff option, so that the behaviour is as follows:

~/src/bamUtil/bin/bam diff --mapQualDiff 5 --posDiff 0 --in1 $s1 --in2 $s2 --out out.sam

This would generate an only1 and only2 output files where the reads
that are differently aligned are printed out, including those reads
that are equally aligned but where the mapping quality score is
different by 5 or more points difference. If the difference is only
below 5 points, they will still be in the common part, and not in the
diffs.

I think everything is contained in Diff.cpp, right?

The output of make Clonelib should be clearer

When I run

git clone [email protected]:statgen/bamUtil.git
make cloneLib

I see:

Makefile.inc:17: ../libStatGen/Makefiles/Makefile.tool: No such file or directory
Unable to locate: ./../libStatGen/Makefiles/Makefile.tool
To change the location, set LIB_PATH_GENERAL or LIB_PATH_BAM_UTIL to the appropriate path to libStatGen. Or specify "make LIB_PATH_GENERAL=yourPath" or "make LIB_PATH_BAM_UTIL=yourPath"
Use make cloneLib if you have git and want to clone the current libStatGen at that location.
Cloning into '../libStatGen'...
remote: Counting objects: 4393, done.
remote: Total 4393 (delta 0), reused 0 (delta 0), pack-reused 4393
Receiving objects: 100% (4393/4393), 4.64 MiB | 0 bytes/s, done.
Resolving deltas: 100% (2165/2165), done.
Checking connectivity... done.
Call make to compile libStatGen and this tool.

That looks a lot like an error, and it shouldn't.

FZ tag of type B not supported

The SAM spec details the type "B" array tag, which is not supported here. The error is:
parsing BAM - Unknown custom field of type FZ:B
terminate called after throwing an instance of 'std::runtime_error'
what(): FAIL_PARSE: Unknown tag type.
Aborted

License?

Is there a license? Is this open source, no commercial restrictions?

Cannot Compile NonPrimaryDedup Branch

Compilation log of bamUtil as per "https://github.com/statgen/docker-alignment/blob/master/Dockerfile" linked to by the Functional Equivalence paper "https://www.nature.com/articles/s41467-018-06159-4" is attached

bamUtil-Compile.txt

`bam diff` should exit with a return code

It would be great if bam diff returns an exit code (0 for identical, 1 for different) just like the diff command does. This will allow us to do something like:

$ bam diff --in1 sample1.bam --in2 sample2.bam --all --out cmp
$ test $? -eq 0 && echo "Identical" || echo "Different"

Option to output bamUtil stats to STDOUT rather than to a file

Hi There,

I am using bamUtil stats to analyze some whole-genome sequencing data like so

curl url_for_my_large_bam.bam | ./bam stats --in -.ubam --cBaseQC results.txt

and results.txt is really large. I would love to be able to pipe it to awk to do some filtering. It seems like it would be pretty easy to add an option to write to STDOUT. Is this an option I am missing or would it be possible to add?

Thanks,
Dylan

Makefile should not ignore system LDFLAGS

If the compiler toolchain is in a non-standard location, the current Makefile does pick up the correct system includes from CFLAGS, but does not use the corresponding -L from LDFLAGS.

This produces a broken binary (linked against the wrong version of libstdc++).

Cut middle of reads

Hi - I'd like to use clipoverlap on really bad somatic WGS FFPE data. A lot of my reads overlap completely and nearly all have some overlap. Many have insert size <100bp. (I didn't design the experiment...)

The way I understand it - if I have 2 reads of 100bp each that overlap by 80bp (insert size 120bp) then clipoverlap will give me 1 read of 20bp and 1 of 100bp. Is that right? What I need is 2 reads of 60bp as output. So trim both reads in the middle of the overlap. is that something you would be interested in doing? Just thought I'd check before re-inventing the wheel.

Regards,Liam

Compiling issue

The program ClipOverlap.cpp calls the header file OverlapBaseQualMap.h but the latter does not exist in the bamUtil zip package. Compiling ends up with the following error message:

ClipOverlap.cpp:29:32: error: OverlapBaseQualMap.h: No such file or directory
make[1]: *** [../obj/ClipOverlap.o] Error 1
make: *** [src] Error 2

Commenting line 29 in ClipOverlap.cpp seems to solve the compiling issue.

I assume the OverlapBaseQualMap.h is not needed anyway since it is not used in ClipOverlap.cpp. Could you please confirm?

Cheers

Bastien

make error

make[1]: Entering directory /home/songsy/softwares/bamUtil/src' make -C ../../libStatGen --no-print-directory opt ar -cru ../libStatGen.a obj/bgzf.o obj/knetfile.o g++ -O4 -pipe -Wall -Werror -Wno-strict-overflow -I../include -I. -D__ZLIB_AVAILABLE__ -D_FILE_OFFSET_BITS=64 -D__STDC_LIMIT_MACROS -o obj/GzipHeader.o -c GzipHeader.cpp -DVERSION="\"1.0.0\"" cc1plus: warnings being treated as errors GzipHeader.cpp: In member function ‘bool GzipHeader::readHeader(UncompressedFileType&)’: GzipHeader.cpp:83: error: comparison between signed and unsigned integer expressions make[3]: *** [obj/GzipHeader.o] Error 1 make[2]: *** [general] Error 2 make[1]: *** [../../libStatGen/libStatGen.a] Error 2 make[1]: Leaving directory/home/songsy/softwares/bamUtil/src'
make: *** [src] Error 2

I have no idea what's the problem! Thanks!!

Compile error

When I run "make" to compile libStatGen and bamUtil, I get the following errors:
make[1]: *** [obj/VcfRecord.o] Error 1
make[1]: Leaving directory `/home/Software/libStatGen/vcf'
make: *** [vcf] Error 2

Thank for your help.

Compilation doesn't work (warnings=errors)

I tried to compile bamutil (version with libgen included -
https://genome.sph.umich.edu/w/images/7/70/BamUtilLibStatGen.1.0.13.tgz) on linux according to manual. However it didn't work because all warnings was treated as errors. I had to eddit a makefile to fix that issue:

cd bamUtil_1.0.13 \
&& sed -i '/-Werror/d' libStatGen/general/Makefile

and after that bamUtil compiled without any error. That fix was not obvious for me at all, and should be either mentioned in manual or the Makefile should be eddited

can not install with conda

Hi, I installed this tool but it's empty with conda, can you give me some suggestions?

bam2FastQ generates empty fastq files

Hi, i recently use bamUtil to extract reads to be paried-read fastq files from bam file. They found read pairs but the generated fastq files are empty.
bam bam2FastQ --in in_file.bam --firstOut in_file_r1.fastq --secondOut in_file_r2.fastq
When it is done running, it says:

Found 80109518 read pairs.
Found 0 unpaired reads.

but the two fastq files are empty with size as 0.. Could you please help me out on it? Thank you