bigly: a pileup library that embraces the huge

bigly is an API and a command-line app (binaries here). It is similar to samtools mpileup but it reports a huge number of additional variables that are useful for structural variant calling and visualization.

For each requested position, the struct below is filled by the appropriate position in any overlapping alignment that meets the requested filters:

// Pile holds the information about a single base.
type Pile struct {
	Chrom                  string
	Pos                    int
	Depth                  int    // count of reads passing filters.
	RefBase                byte   // Reference base a this position.
	MisMatches             uint32 // number of mismatches .
	ProperPairs            int    // count of reads with paired flag
	SoftStarts             uint32 // counts of base preceding an 'S' cigar op
	SoftEnds               uint32 // ...  following ...
	HardStarts             uint32 // counts of base preceding an 'H' cigar op
	HardEnds               uint32
	InsertionStarts        uint32 // counts of base preceding an 'I' cigar op
	InsertionEnds          uint32
	Deletions              uint32  // counts of deletions 'D' at this base
	Heads                  uint32  // counts of starts of reads at this base
	Tails                  uint32  // counts of ends of reads at this base
	Splitters              uint32  // count of non-secondary reads with SA tags.
	Splitters1             uint32  // count of non-secondary reads with exactly 1 SA tag.
	Bases                  []byte  // All bases from reads covering this position
	Quals                  []uint8 // All quals from reads covering this position
	MeanInsertSizeLP       uint32  // Calculated with left-most of pair
	MeanInsertSizeRM       uint32  // Calculated with right-most of pair
	OrientationPlusPlus    uint32  // Paired reads mapped in +/+ orientation
	OrientationMinusMinus  uint32  // Paired reads mapped in -/- orientation
	OrientationMinusPlus   uint32  // Paired reads mapped in -/+ orientation
    OrientationSplitter    uint32  // Count of +/- or -/+ splitters.
	Discordant             uint32  // Number of reads with insert size > ConcordantCutoff
	DiscordantChrom        uint32  // Number of reads mapping on different chroms
	DiscordantChromEntropy float32 // high value means all discordants came from same chrom.
	GC65                   uint32
	GC257                  uint32
	Duplicity65            float32 // measure of lack of sequence entropy.
	Duplicity257           float32 // measure of lack of sequence entropy.
	SplitterPositions      []int
	SplitterStrings        []string
}

The program in cmd/bigly/main.go is distributed as an example program of what one can do with this library--namely make an enhanced pileup in a few lines of code.

Usage

At this time, the usage of the example program is very simple. Default exclude flags are (sam.Unmapped | sam.QCFail | sam.Duplicate)

bigly $bam $chrom:$start-$end > o

if a reference is specified with -r it will report statistics about GC content in windows surrounding each base.

help:

bigly 0.2.0
usage: bigly [--minbasequality MINBASEQUALITY] [--minmappingquality MINMAPPINGQUALITY] [--excludeflag EXCLUDEFLAG] [--includeflag INCLUDEFLAG] [--mincliplength MINCLIPLENGTH] [--includebases] [--splitterverbosity SPLITTERVERBOSITY] [--reference REFERENCE] BAMPATH REGION

positional arguments:
  bampath
  region

options:
  --minbasequality MINBASEQUALITY, -q MINBASEQUALITY
                         base quality threshold [default: 10]
  --minmappingquality MINMAPPINGQUALITY, -Q MINMAPPINGQUALITY
                         mapping quality threshold [default: 5]
  --excludeflag EXCLUDEFLAG, -F EXCLUDEFLAG
  --includeflag INCLUDEFLAG, -f INCLUDEFLAG
  --mincliplength MINCLIPLENGTH, -c MINCLIPLENGTH
                         only count H/S clips of at least this length [default: 15]
  --includebases, -b     output each base and base quality score
  --splitterverbosity SPLITTERVERBOSITY, -s SPLITTERVERBOSITY
                         0-only count; 1:count and single most frequent; 2:all SAs; 3:dont shorten positions
  --reference REFERENCE, -r REFERENCE
                         optional path to reference fasta.
  --help, -h             display this help and exit
  --version              display version and exit

API

GoDoc Documentation is here: ![GoDoc] (https://godoc.org/github.com/brentp/bigly?status.png)

With a supporting library easing regional bam queries here: ![GoDoc] (https://godoc.org/github.com/brentp/bigly/bamat?status.png)

View the tests for more examples. bigly uses biogo library for bam access.

Plotting

python scripts/plotter.py bigly.output

will make a plot with python+matplotlib on output from bigly.

An example image looks like:

This is a homozygous deletion easily seen from the depth, but we can see that the deletion is nicely delineated by soft-clips and we see aberrant insert sizes bounding the deletion. In most libraries, we would also see splitters flanking the region.

TODO

Track strand of bases.
Report the 5th and 95th percentile of insert size.
More efficient insert-size calc

Credits

Ryan Layer's svv was the inspiration for the plotter.

Obviously, samtools is the reference pileup implementation.

Did I use bamat or biogo correctly?

Hi Brent,

I was trying to fetch records from a bam file with golang and bamat seems to be the most straight-forward solution I have ever seen. But when I did some tests, biogo/hts or bamat seemed to give unexpected results. Did I use biogo/hts or bamat correctly?

Here is my code:

First I tested with samtools:

# this command gives only one record
# samtools/htslib version: 1.11
./samtools view ./test.bam  chr1:100000-100200

Then I tried biogo/hts & bamat

import (
	"bufio"
	"github.com/biogo/hts/bam"
	"github.com/biogo/hts/sam"
	"os"
)

/*
The entire bamat source codes are copied here
*/

// this program gives 2000+ records
// the position of the first record starts from the beginning of bam
func main() {

	bamat, _ := New("test.bam")
	defer bamat.Close()

	iter, _ := bamat.Query("chr1", 100000, 100200)
	for iter.Next() {
		rec := iter.Record()
		println(rec.Name)
	}
}

Additional information:
system: centos7

test.bam is the alignment result of a paired-end sequencing experient.
test.bam is sorted by chromosome and position
test.bam.bai is built with samtools index command

I also printed the information of offsets and chunks:

// Query the BamAt with 0-base half-open interval.
func (b *BamAt) Query(chrom string, start int, end int) (*bam.Iterator, error) {
	if chrom == "" { // stdin
		return bam.NewIterator(b.Reader, nil)
	}
	ref := b.Refs[chrom]
	if end <= 0 {
		end = ref.Len() - 1
	}
	chunks, err := b.idx.Chunks(ref, start, end)
	for _, c := range chunks {
		fmt.Printf("Chunk, c.Begin.File: %v, c.Begin.Block: %v \n", c.Begin.File, c.Begin.Block)
		fmt.Printf("Chunk, c.End.File: %v, c.End.Block: %v \n", c.End.File, c.End.Block)
	}
	if err != nil {
		return nil, err
	}
	return bam.NewIterator(b.Reader, chunks)
}

output:

Chunk, c.Begin.File: 1157, c.Begin.Block: 0
Chunk, c.End.File: 152443, c.End.Block: 61430
Chunk, c.Begin.File: 305256, c.Begin.Block: 45799
Chunk, c.End.File: 369546, c.End.Block: 49973
Chunk, c.Begin.File: 4495641, c.Begin.Block: 6375
Chunk, c.End.File: 4495641, c.End.Block: 7201
Chunk, c.Begin.File: 4678143, c.Begin.Block: 22248
Chunk, c.End.File: 4678143, c.End.Block: 22521
Chunk, c.Begin.File: 5535256, c.Begin.Block: 32237
Chunk, c.End.File: 5535256, c.End.Block: 33167
Chunk, c.Begin.File: 8076117, c.Begin.Block: 22784
Chunk, c.End.File: 8076117, c.End.Block: 24440
Chunk, c.Begin.File: 8319574, c.Begin.Block: 22863
Chunk, c.End.File: 8319574, c.End.Block: 24508
Chunk, c.Begin.File: 54542702, c.Begin.Block: 55586
Chunk, c.End.File: 54542702, c.End.Block: 62429
Chunk, c.Begin.File: 59486419, c.Begin.Block: 10427
Chunk, c.End.File: 59486419, c.End.Block: 10975

This is my python code which gives the correct record:

import pysam

def query_bam(region, bamfile):

    samfile = pysam.AlignmentFile(bamfile, "rb")
    iterable = samfile.fetch(region=region)

    record_count = 0
    for record in iterable:
        print(record.query_name)
        record_count += 1

    samfile.close()
    return record_count

if __name__ == "__main__":
    bam_file = "test.bam"
    region = "chr1:100000-100200"
    total_records = query_bam(region, bam_file)
    print(f"Total records found: {total_records}")

brentp / bigly Goto Github PK

bigly's Introduction

bigly: a pileup library that embraces the huge

Usage

API

Plotting

TODO

Credits

bigly's People

Contributors

Stargazers

Watchers

Forkers

bigly's Issues

Recommend Projects

Recommend Topics

Recommend Org