biojulia / xam.jl Goto Github PK

Parse and process SAM and BAM formatted files

License: MIT License

Julia 100.00%

bioinformatics sam-format bam-format sam bam sam-files bam-files biojulia

xam.jl's Issues

Error when opening a BAM file generated by HISAT2 version 2.2.0

I have used XAM.jl to treat BAM files. Today I came across an error when opening a BAM file generated by HISAT2 version 2.2.0. No error occurred for a BAM file generated by HISAT2 version 2.1.0 (from the same FASTQ file)

Expected Behavior

No errors are expected.

julia> using XAM
julia> reader = open(BAM.Reader, "SRR8452726_1.sort.bam")

Current Behavior

julia> using XAM
julia> reader = open(BAM.Reader, "SRR8452726_1.sort.bam")
ERROR: ArgumentError: malformed metainfo at line 68
Stacktrace:
 [1] readheader!(::TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,Base.GenericIOBuffer{Array{UInt8,1}}}, ::XAM.SAM.Header, ::Tuple{Int64,Int64}) at /Users/ozakiharuka/.julia/packages/XAM/ahh4D/src/sam/readrecord.jl:318
 [2] Reader at /Users/ozakiharuka/.julia/packages/XAM/ahh4D/src/sam/reader.jl:13 [inlined]
 [3] Reader at /Users/ozakiharuka/.julia/packages/XAM/ahh4D/src/sam/reader.jl:38 [inlined]
 [4] init_bam_reader(::BGZFStreams.BGZFStream{IOStream}) at /Users/ozakiharuka/.julia/packages/XAM/ahh4D/src/bam/reader.jl:96
 [5] init_bam_reader at /Users/ozakiharuka/.julia/packages/XAM/ahh4D/src/bam/reader.jl:125 [inlined]
 [6] #Reader#2 at /Users/ozakiharuka/.julia/packages/XAM/ahh4D/src/bam/reader.jl:36 [inlined]
 [7] Reader at /Users/ozakiharuka/.julia/packages/XAM/ahh4D/src/bam/reader.jl:31 [inlined]
 [8] #open#1 at /Users/ozakiharuka/.julia/packages/BioGenerics/cCuGr/src/IO.jl:42 [inlined]
 [9] open(::Type{XAM.BAM.Reader}, ::String) at /Users/ozakiharuka/.julia/packages/BioGenerics/cCuGr/src/IO.jl:42
 [10] top-level scope at REPL[2]:1
(v1.3) pkg> status
....
  [d759349c] XAM v0.2.3
....
julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

Possible Solution / Implementation

I have no idea but the release note of HISAT2 might help.

HISAT 2.2.0 release 2/6/2020
This major version update includes a new feature to handle “repeat” reads. Based on sets of 100-bp simulated and 101-bp real reads that we tested, we found that 2.6-3.4% and 1.4-1.8% of the reads were mapped to >5 locations and >100 locations, respectively. Attempting to report all alignments would likely consume a prohibitive amount of disk space. In order to address this issue, our repeat indexing and alignment approach directly aligns reads to repeat sequences, resulting in one repeat alignment per read. HISAT2 provides application programming interfaces (API) for C++, Python, and JAVA that rapidly retrieve genomic locations from repeat alignments for use in downstream analyses.
Other minor bug fixes are also included as follows:

Fixed occasional sign (+ or -) issues of template lengths in SAM file
Fixed duplicate read alignments in SAM file
Skip a splice site if exon’s last base or first base is ambiguous (N)

Steps to Reproduce (for bugs)

Download the BAM file from here.
Run the following codes:

julia> using XAM
julia> reader = open(BAM.Reader, "SRR8452726_1.sort.bam")

Context

Your Environment

Julia Version 1.3.1
XAM v0.2.3
OS: macOS (x86_64-apple-darwin18.6.0)
CPU: Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

(v1.3) pkg> status
    Status `~/.julia/environments/v1.3/Project.toml`
  [c7e460c6] ArgParse v1.1.0
  [8e4a8c10] BED v0.1.0
  [00701ae9] BioAlignments v2.0.0
  [7e6ae17a] BioSequences v2.0.5
  [336ed68f] CSV v0.6.1
  [9961bab8] Cbc v0.6.7
  [aaaa29a8] Clustering v0.14.0
  [8f4d0f93] Conda v1.4.1
  [a93c6f00] DataFrames v0.20.2
  [968ba79b] DocOpt v0.4.0
  [8f5d6c58] EzXML v1.1.0
  [c2308a5c] FASTX v1.1.2
  [652a1917] Fire v0.1.0
  [587475ba] Flux v0.11.0
  [899a7d2d] GenomicFeatures v2.0.0
  [c27321d9] Glob v1.3.0
  [a2cc645c] GraphPlot v0.4.2
  [cd3eb016] HTTP v0.8.16
  [7073ff75] IJulia v1.21.2
  [682c06a0] JSON v0.21.0
  [4076af6c] JuMP v0.21.2
  [093fc24a] LightGraphs v1.3.1
  [9b87118b] PackageCompiler v1.1.1
  [91a5bcdd] Plots v1.0.10
  [d330b81b] PyPlot v2.9.0
  [777a009e] ReadCoverage v0.1.1 #master (https://github.com/bioinfo-tsukuba/ReadCoverage.jl)
  [3cdcf5f2] RecipesBase v1.0.0
  [01d81517] RecipesPipeline v0.1.3
  [2913bbd2] StatsBase v0.33.0
  [70df011a] TableReader v0.4.0
  [9c690861] TensorToolbox v1.0.1
  [d759349c] XAM v0.2.3
  [ddb6d928] YAML v0.4.0
  [8bb1440f] DelimitedFiles
  [de0858da] Printf

Various spec violations

Dear @CiaranOMara

I've been working a bit with XAM.jl. Great package, but I've found quite a number of small errors and mistakes I think should be fixed. Making one PR for each would be overwhelming, especially since fixing some of them has implications that are not straightforward. So I'd like to discuss it a bit here:

List of errors/inconsistencies

Missing qualities are set to 0xff, but hasquality does not check this, so the result may be wrong
refname should be accessible even if the read is unmapped (per specs section 1.4.3)
isnextmapped checks if the record is filled, but ismapped doesn't
hasauxdata doesn't even check if the record has aux data
quality(record) raises an obscure error if a quality is > 127
function MetaInfo should check that a tag has two characters per the specs
ismapped, isprimary and ispositive strand errors when called on an unfilled record, BUT isnextmapped doesn't
hasrefid, hasrefname and hasposition just returns if the record is filled, not what is actually asked. But hasrightposition behaves differently. Worse, refname behaves differently too, meaning you could have hasrefname() == true, yet it still errors when the refname is fetched.

Proposals/enhancements

We should just get rid of unfilled records, they make little sense as a concept. Instead, we initialize every record as an empty, but valid record. This way, we never need to check if the record is filled, since it always is.
Then we can remove e.g. hasflag since every record always have flags
all hasX function should return a Bool
if hasX function returns false, the accessor (X) function should error.
Rewrite the memory layout of the BAM record so it exactly matches the specifications. This will not result in any performance issues, just make the underlying code simpler and make it easier to interface with C in the future

I've implemented many of these changes at https://github.com/jakobnissen/XAM.jl, but that "fork" is stale, and I'd rather merge the fixes into the original XAM.jl

Memory leak

I've observed a memory leak with the in-place reading/iteration pattern.

using XAM
using ProgressMeter

io = open(file_sam, "r")
reader = XAM.SAM.Reader(io)

p = Progress(filesize(file_sam))

record = XAM.SAM.Record()
while !eof(reader)
    read!(reader, record)
    #...
    ProgressMeter.update!(p, position(reader.state.stream),showvalues = [(:position,position(reader.state.stream))])
end

close(io)

Inconsistent BAM flag accessor functions

Currently, it's possible to create a spec-compliant BAM record with the following behaviour:

julia> BAM.hasrefname(record)
true

julia> BAM.refname(record)
ERROR: ArgumentError: record is not mapped
Stacktrace:
 [1] checked_refid at /home/jakni/.julia/packages/BioAlignments/AGQde/src/bam/record.jl:162 [inlined]
 [2] refname(::BioAlignments.BAM.Record) at /home/jakni/.julia/packages/BioAlignments/AGQde/src/bam/record.jl:178
 [3] top-level scope at none:0

(on a related note, the function BAM.hasnextrefname simply doesn't work.)

What's going on?

Well, for a subset (which subset, I'm not quite sure) of the flags, there is a corresponding flag accessor isX function and a function hasX. I'm not entirely sure what the purpose of the hasX function are - some of them seem to be exported to BioCore. It might be I'm simply confused as to what the purpose of these functions are, if so, please tell me.
Anyway, the behaviour of these functions are inconsistent and highly unintuitive. For example, ismapped, isprimary and ispositivestrand all raise an error if the record is unfilled, else return a boolean. However isnextmapped returns false with an unfilled record.

Furthermore, hasrefid, hasrefname, hasnextrefid returns whether the record is filled, NOT whether the record is aligned (i.e. actually has a refid etc.). So a valid, unmapped record can have an ID of -1 and still return true when asked if hasrefid. The similar function hasnextrefname, however, actually does check whether it has a valid next reference name.

Also, refname and nextrefname checks for the validity of the corresponding IDs before fetching the name, whereas hasrefname does not (hence the error at the beginning of this issue).

hasposition returns true if the Record is unmapped (i.e. has an invalid position), but hasrightposition returns false in that case.

hasflag seem to me to be meaningless. All BAM records by definition has flags. Currently, it simply returns isfilled(record), and is not used in any other code.

This seems very unsystematic. I propose changing all these functions to follow the following rules:

If an existing function hasX returns true, then isX will always return a boolean which is guaranteed to be correct.
If an existing function hasX returns false, then isX will always throw an error
The functions hasrefid, hasrefname, hasnextrefid and hasnextrefname will return false if the record is unfilled, or if the record/mate is unmapped, else true
The functions hasposition, hasnextposition and hasrightposition returns false is the record is unfilled or the record/mate is unmapped, else true

Add eachoverlap(reader::Reader, refname::String)

Hi,

Would it be possible to add eachoverlap such that it can be browsed by chromosome when the BAM is indexed ?

Expected Behavior

chrom = "chr1"
itr = eachoverlap(reader, chrom)

Current Behavior

chrom = "chr1"
idx = findfirst(isequal(chrom), reader.refseqnames)
refseqlen = reader.refseqlens[idx]

itr = eachoverlap(reader, chrom, 1:refseqlen))

Thanks,

Switch to Oneflow

I would contend that Oneflow is a better branching model than Git Flow for Julia packages. BioAlignments made the transition (albeit, a rocky one). What do others think?

Add `index!(reader::Reader, path)`

See BioJulia/BioGenerics.jl#14 (comment).
See BioJulia/FASTX.jl@c66d035 for an implementation example.

how to use @threads to use multithreads when reading BAM file

how to use @threads to read BAM file more than one core?

Browsing the documentation

How do you browse the documentation? Is it possible to simply include a link in the Readme that links with the documentation markdown files?

Converge SAM and BAM records

Make SAM and BAM records display similarly
A SAM and BAM record contain basically the same information. So they should be displayed similarly in the REPL.
Create SAM-to-BAM conversion functions
This is a little harder, but probably worthwhile.

Install ERROR: Unsatisfiable requirements detected for package BioSequences

Today I faced a problem on installing XAM.jl:

(v1.6) pkg> add XAM
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package BioSequences [7e6ae17a]:
 BioSequences [7e6ae17a] log:
 ├─possible versions are: 1.0.0-2.0.5 or uninstalled
 ├─restricted to versions * by an explicit requirement, leaving only versions 1.0.0-2.0.5
 ├─restricted by compatibility requirements with XAM [d759349c] to versions: 2.0.0-2.0.5
 │ └─XAM [d759349c] log:
 │   ├─possible versions are: 0.1.0-0.2.7 or uninstalled
 │   └─restricted to versions * by an explicit requirement, leaving only versions 0.1.0-0.2.7
 └─restricted by compatibility requirements with Bio [3637df68] to versions: 1.0.0-1.1.0 — no versions left                                                                                                  
   └─Bio [3637df68] log:
     ├─possible versions are: 1.0.0-1.0.1 or uninstalled
     └─restricted to versions * by an explicit requirement, leaving only versions 1.0.0-1.0.1

It is strange since I have already installed BioSequences.jl:

(v1.6) pkg> status BioSequences
      Status `/data/projects/Project_JIE/programs/CONFIG_FILES_AT_HOME_DIR/dot_julia/environments/v1.6/Project.toml`                                                                                         
  [7e6ae17a] BioSequences v1.1.0

I'm on a Linux server. The Julia info is:

julia> versioninfo()
Julia Version 1.6.1
Commit d5c1cfd* (2021-05-06 13:40 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E7-4880 v2 @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, ivybridge)

Does anyone know the reason? Thanks.

Correct way to add records to an array

I'm trying to read a BAM file and add only mapped records to an array. But irrespective of the BAM file used the return records array is filled with the same record (the record changes when I change the file but with same file its all filled same record).

function read_bam(bam_path)
    rec = []
    reader = open(BAM.Reader, bam_path)
    record = BAM.Record()
    while !eof(reader)
        empty!(record)
        read!(reader, record)
        if BAM.ismapped(record)
            push!(rec, record)
        end
    end
    return rec
end

Am I doing something wrong here? Is there a way to achieve this?

Your Environment

Package Version used: 0.3.1
Julia Version used: 1.7.2
Operating System and version (desktop or mobile): Ubuntu 20.04

BoundsError in eachoverlap()

Expected Behavior

I tried to count the number of BAM.Record objects overlapped with a specified genomic region by using eachoverlap(bam_reader, chrom, leftpos:rightpos)

julia> using GenomicFeatures

julia> using XAM

julia> using BioAlignments

julia> function test_bam_eachoverlap(path_bam, chrom, leftpos, rightpos)
	bam_reader = open(BAM.Reader, path_bam, index = path_bam * ".bai")
	count = 0
	for record in eachoverlap(bam_reader, chrom, leftpos:rightpos)
		count += 1
	end
	close(bam_reader)
	return count
end

This BAM file (SRR7993829_1.100K.forward.bam) has one record for JH584304.1:

$ samtools view SRR7993829_1.100K.forward.bam JH584304.1
SRR7993829.4835529.1    272     JH584304.1      51715   0       15M559N60M      *       0       0       CAAGCAGAACCTTTGGTCATGCCAGCGACAAGCCAACAGCCTGCGTAGGTGGCTGGAGCTAAGCCAGGAGGAGCG       JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA     AS:i:-5 ZS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:74C0       YT:Z:UU XS:A:+  NH:i:2

So, the expected behavior is as follows:

julia> test_bam_eachoverlap("SRR7993829_1.100K.forward.bam", "JH584304.1", 51000, 51200)
0

julia> test_bam_eachoverlap("SRR7993829_1.100K.forward.bam", "JH584304.1", 51000, 51715)
1

Current Behavior

I got an BoundsError:

julia> test_bam_eachoverlap("SRR7993829_1.100K.forward.bam", "JH584304.1", 51000, 51200)
0

julia> test_bam_eachoverlap("SRR7993829_1.100K.forward.bam", "JH584304.1", 51000, 51715)
ERROR: BoundsError: attempt to access 1-element Array{BGZFStreams.Block,1} at index [2]
Stacktrace:
 [1] getindex at ./array.jl:744 [inlined]
 [2] virtualoffset(::BGZFStreams.BGZFStream{IOStream}) at /Users/haruka/.julia/packages/BGZFStreams/qApsr/src/bgzfstream.jl:156
 [3] iterate(::XAM.BAM.OverlapIterator{IOStream}, ::XAM.BAM.OverlapIteratorState) at /Users/haruka/.julia/packages/XAM/9nk3g/src/bam/overlap.jl:65
 [4] test_bam_eachoverlap(::String, ::String, ::Int64, ::Int64) at ./REPL[83]:5
 [5] top-level scope at REPL[85]:1

julia> test_bam_eachoverlap("SRR7993829_1.100K.forward.bam", "JH584304.1", 51715, 51716)
ERROR: BoundsError: attempt to access 1-element Array{BGZFStreams.Block,1} at index [2]
Stacktrace:
 [1] getindex at ./array.jl:744 [inlined]
 [2] virtualoffset(::BGZFStreams.BGZFStream{IOStream}) at /Users/haruka/.julia/packages/BGZFStreams/qApsr/src/bgzfstream.jl:156
 [3] iterate(::XAM.BAM.OverlapIterator{IOStream}, ::XAM.BAM.OverlapIteratorState) at /Users/haruka/.julia/packages/XAM/9nk3g/src/bam/overlap.jl:65
 [4] test_bam_eachoverlap(::String, ::String, ::Int64, ::Int64) at ./REPL[83]:5
 [5] top-level scope at REPL[86]:1

Possible Solution / Implementation

I'm not sure how to fix the error, but I think that the error is relevant to the fact that the last record in this BAM file (SRR7993829_1.100K.forward.bam) is on JH584304.1:

$ samtools idxstats hisat2/SRR7993829_1.100K.forward.bam
chr1	195471971	1565	0
chr2	182113224	2116	0
chr3	160039680	1599	0
chr4	156508116	1789	0
chr5	151834684	1699	0
chr6	149736546	1486	0
chr7	145441459	1626	0
chr8	129401213	1393	0
chr9	124595110	1659	0
chr10	130694993	1230	0
chr11	122082543	2123	0
chr12	120129022	1580	0
chr13	120421639	1018	0
chr14	124902244	937	0
chr15	104043685	913	0
chr16	98207768	7585	0
chr17	94987271	2645	0
chr18	90702639	946	0
chr19	61431566	783	0
chrX	171031299	921	0
chrY	91744698	51	0
chrM	16299	730	0
GL456210.1	169725	0	0
GL456211.1	241735	0	0
GL456212.1	153618	0	0
GL456213.1	39340	0	0
GL456216.1	66673	0	0
GL456219.1	175968	0	0
GL456221.1	206961	0	0
GL456233.1	336933	0	0
GL456239.1	40056	0	0
GL456350.1	227966	2	0
GL456354.1	195993	0	0
GL456359.1	22974	0	0
GL456360.1	31704	0	0
GL456366.1	47073	0	0
GL456367.1	42057	0	0
GL456368.1	20208	0	0
GL456370.1	26764	0	0
GL456372.1	28664	0	0
GL456378.1	31602	0	0
GL456379.1	72385	0	0
GL456381.1	25871	0	0
GL456382.1	23158	0	0
GL456383.1	38659	0	0
GL456385.1	35240	0	0
GL456387.1	24685	0	0
GL456389.1	28772	0	0
GL456390.1	24668	0	0
GL456392.1	23629	0	0
GL456393.1	55711	0	0
GL456394.1	24323	0	0
GL456396.1	21240	0	0
JH584292.1	14945	0	0
JH584293.1	207968	1	0
JH584294.1	191905	6	0
JH584295.1	1976	0	0
JH584296.1	199368	0	0
JH584297.1	205776	0	0
JH584298.1	184189	1	0
JH584299.1	953012	0	0
JH584300.1	182347	0	0
JH584301.1	259875	0	0
JH584302.1	155838	0	0
JH584303.1	158099	0	0
JH584304.1	114452	1	0
ERCC-00002	1061	0	0
ERCC-00003	1023	0	0
ERCC-00004	523	0	0
ERCC-00009	984	0	0
ERCC-00012	994	0	0
ERCC-00013	808	0	0
ERCC-00014	1957	0	0
ERCC-00016	844	0	0
ERCC-00017	1136	0	0
ERCC-00019	644	0	0
ERCC-00022	751	0	0
ERCC-00024	536	0	0
ERCC-00025	1994	0	0
ERCC-00028	1130	0	0
ERCC-00031	1138	0	0
ERCC-00033	2022	0	0
ERCC-00034	1019	0	0
ERCC-00035	1130	0	0
ERCC-00039	740	0	0
ERCC-00040	744	0	0
ERCC-00041	1122	0	0
ERCC-00042	1023	0	0
ERCC-00043	1023	0	0
ERCC-00044	1156	0	0
ERCC-00046	522	0	0
ERCC-00048	992	0	0
ERCC-00051	274	0	0
ERCC-00053	1023	0	0
ERCC-00054	274	0	0
ERCC-00057	1021	0	0
ERCC-00058	1136	0	0
ERCC-00059	525	0	0
ERCC-00060	523	0	0
ERCC-00061	1136	0	0
ERCC-00062	1023	0	0
ERCC-00067	644	0	0
ERCC-00069	1137	0	0
ERCC-00071	642	0	0
ERCC-00073	603	0	0
ERCC-00074	522	0	0
ERCC-00075	1023	0	0
ERCC-00076	642	0	0
ERCC-00077	273	0	0
ERCC-00078	993	0	0
ERCC-00079	644	0	0
ERCC-00081	534	0	0
ERCC-00083	1022	0	0
ERCC-00084	994	0	0
ERCC-00085	844	0	0
ERCC-00086	1020	0	0
ERCC-00092	1124	0	0
ERCC-00095	521	0	0
ERCC-00096	1107	0	0
ERCC-00097	523	0	0
ERCC-00098	1143	0	0
ERCC-00099	1350	0	0
ERCC-00104	2022	0	0
ERCC-00108	1022	0	0
ERCC-00109	536	0	0
ERCC-00111	994	0	0
ERCC-00112	1136	0	0
ERCC-00113	840	0	0
ERCC-00116	1991	0	0
ERCC-00117	1136	0	0
ERCC-00120	536	0	0
ERCC-00123	1022	0	0
ERCC-00126	1118	0	0
ERCC-00130	1059	0	0
ERCC-00131	771	0	0
ERCC-00134	274	0	0
ERCC-00136	1033	0	0
ERCC-00137	537	0	0
ERCC-00138	1024	0	0
ERCC-00142	493	0	0
ERCC-00143	784	0	0
ERCC-00144	538	0	0
ERCC-00145	1042	0	0
ERCC-00147	1023	0	0
ERCC-00148	494	0	0
ERCC-00150	743	0	0
ERCC-00154	537	0	0
ERCC-00156	494	0	0
ERCC-00157	1019	0	0
ERCC-00158	1027	0	0
ERCC-00160	743	0	0
ERCC-00162	523	0	0
ERCC-00163	543	0	0
ERCC-00164	1022	0	0
ERCC-00165	872	0	0
ERCC-00168	1024	0	0
ERCC-00170	1023	0	0
ERCC-00171	505	0	0
*	0	0	0

Steps to Reproduce (for bugs)

You can donwload the BAM file from https://www.dropbox.com/sh/8kenlaq25wenmvp/AAC622OEyG4RsuvriUNInfLMa?dl=0
The code is below:

using GenomicFeatures
using XAM
using BioAlignments

function test_bam_eachoverlap(path_bam, chrom, leftpos, rightpos)
	bam_reader = open(BAM.Reader, path_bam, index = path_bam * ".bai")
	count = 0
	for record in eachoverlap(bam_reader, chrom, leftpos:rightpos)
		count += 1
	end
	close(bam_reader)
	return count
end

test_bam_eachoverlap("SRR7993829_1.100K.forward.bam", "JH584304.1", 51000, 51715)

Your Environment

Package Version used: BioAlignments v2.0.0, GenomicFeatures v2.0.3, XAM v0.2.6
Julia Version used: 1.3.0 and 1.5.1
Operating System and version (desktop or mobile): macOS 10.13.6

Performance of BAM.quality ?

I posted this some time ago and someone mentioned that the performance of BAM.quality wasn't optimal. I also see in some of my code that it takes a significant portion of time.

Is there a reason for that reinterpret and conversion business ?

https://github.com/BioJulia/BioAlignments.jl/blob/master/src/bam/record.jl#L548

I've tried the following and it seems to work fine:

function quality(record)
    BAM.checkfilled(record)
    seqlen = BAM.seqlength(record)
    offset = BAM.seqname_length(record) + BAM.n_cigar_op(record, false) * 4 + BAM.cld(seqlen, 2)
    
    return record.data[(1+offset):(seqlen+offset)]
end

Cut down the time from 232 to 65 ns. Returning a view cuts it by further by 2. Maybe there's some cases where that fails ? I saw the bam specifications says qualities are char and not uint8.

error when using SAM.auxdata(record)

It seems that the optional fields aren't correctly parsed which in turn leads to an error raised by SAM.auxdata(record).

I find that some keys returned by keys(record) are incorrect.

For example, AS:i:0 returns :i as the key.

Make `seqname` consistent with BioJulia ecosystem

Make seqname an alias of refname to make seqname consistent with the rest of the BioJulia ecosystem - namely GenomicFeatures.

Also, make BioGenerics.groupname an alias of refname. The groupname method will be used in GenomicFeatures v3.

Broken with Automa 0.8.1

Problem

XAM 0.2.6 doesn't work with Automa 0.8.1 (and because it's not pinned to 0.8.0, that version is installed by default)
Precompilation fails. Trace:

julia> using XAM
[ Info: Precompiling XAM [d759349c-bcba-11e9-07c2-5b90f8f05f7c]
ERROR: LoadError: LoadError: LoadError: Ambiguous DFA: Input 0x21 can lead to actions [:header, :mark, :pos] or [:header]
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:33
  [2] validate_paths(paths::Vector{Tuple{Union{Nothing, Automa.Edge}, Vector{Symbol}}})
    @ Automa ~/.julia/packages/Automa/vXuxy/src/dfa.jl:179
  [3] validate_nfanodes
    @ ~/.julia/packages/Automa/vXuxy/src/dfa.jl:198 [inlined]
  [4] nfa2dfa(nfa::Automa.NFA, unambiguous::Bool)
    @ Automa ~/.julia/packages/Automa/vXuxy/src/dfa.jl:103
  [5] compile(re::Automa.RegExp.RE; optimize::Bool, unambiguous::Bool)
    @ Automa ~/.julia/packages/Automa/vXuxy/src/machine.jl:55
  [6] compile
    @ ~/.julia/packages/Automa/vXuxy/src/machine.jl:55 [inlined]
  [7] map
    @ ./tuple.jl:215 [inlined]
  [8] map (repeats 2 times)
    @ ./tuple.jl:216 [inlined]
  [9] (::XAM.SAM.var"#28#29")()
    @ XAM.SAM ~/.julia/packages/XAM/9nk3g/src/sam/readrecord.jl:146
 [10] top-level scope
    @ ~/.julia/packages/XAM/9nk3g/src/sam/readrecord.jl:11
 [11] include(mod::Module, _path::String)
    @ Base ./Base.jl:386
 [12] include(x::String)
    @ XAM.SAM ~/.julia/packages/XAM/9nk3g/src/sam/sam.jl:4
 [13] top-level scope
    @ ~/.julia/packages/XAM/9nk3g/src/sam/sam.jl:69
 [14] include(mod::Module, _path::String)
    @ Base ./Base.jl:386
 [15] include(x::String)
    @ XAM ~/.julia/packages/XAM/9nk3g/src/XAM.jl:1
 [16] top-level scope
    @ ~/.julia/packages/XAM/9nk3g/src/XAM.jl:7
 [17] include
    @ ./Base.jl:386 [inlined]
 [18] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
    @ Base ./loading.jl:1213
 [19] top-level scope
    @ none:1
 [20] eval
    @ ./boot.jl:360 [inlined]
 [21] eval(x::Expr)
    @ Base.MainInclude ./client.jl:446
 [22] top-level scope
    @ none:1
in expression starting at /home/dialvarezs/.julia/packages/XAM/9nk3g/src/sam/readrecord.jl:11
in expression starting at /home/dialvarezs/.julia/packages/XAM/9nk3g/src/sam/sam.jl:4
in expression starting at /home/dialvarezs/.julia/packages/XAM/9nk3g/src/XAM.jl:1
ERROR: Failed to precompile XAM [d759349c-bcba-11e9-07c2-5b90f8f05f7c] to /home/dialvarezs/.julia/compiled/v1.6/XAM/jl_nOmVNU.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::Base.TTY, internal_stdout::Base.TTY)
   @ Base ./loading.jl:1360
 [3] compilecache(pkg::Base.PkgId, path::String)
   @ Base ./loading.jl:1306
 [4] _require(pkg::Base.PkgId)
   @ Base ./loading.jl:1021
 [5] require(uuidkey::Base.PkgId)
   @ Base ./loading.jl:914
 [6] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:901

Environment

Package Version used: 0.2.6
Julia Version used: 1.6.0
Operating System and version: Fedora 33

SAM.Reader file format error

I'm trying to use the bioalignments sam reader to read the alignments in a file created by Minimap2. This file looks like a 'normal' sam file, but I still get the following error:

BioAlignments.SAM.Reader file format error on line 1536 ~>"\t*\n"

Line 1536 is the first line in the file that contains an alignment.

I am using the following code and the error occurs on line 4:

    reader = SAM.Reader(open(samfile,"r"))
    record = SAM.Record()
    while !eof(reader)
            read!(reader, record)
    end

Is there any way to check whether chromosome is available for a BAM file when using eachoverlap()?

I'd like to use eachoverlap() as the example in the documentation.
However, when the region is not available in a BAM file (e.g., special chromosome names like patches in human reference genome), I got an error (see below for details):

ERROR: MethodError: no method matching overlapchunks(::Indexes.BGZFIndex, ::Nothing, ::UnitRange{Int64})
Closest candidates are:
  overlapchunks(::Indexes.BGZFIndex, ::Integer, ::UnitRange) at /Users/ozakiharuka/.julia/packages/Indexes/fj340/src/bgzfindex.jl:41

Expected Behavior

Current Behavior

Possible Solution / Implementation

I think one way to handle such error is to check whether a BAM file contains the region before eachoverlap() is called.
Does anybody know how to do this?

Steps to Reproduce (for bugs)

Test data (.bam, .bam.bai) is here

julia> using XAM

julia> using GenomicFeatures

julia> function print_overlapped_record(bam_file, chrom, leftpos, rightpos)
         open(BAM.Reader, bam_file, index = bam_file * ".bai") do reader
           for record in eachoverlap(reader, chrom, leftpos:rightpos)
             print(record)
           end
         end
       end
print_overlapped_record (generic function with 1 method)

julia> testfun(bam_file, "chr1", 3205900, 3616344)
XAM.BAM.Record:
      template name: NS500723:19:H5LLWBGXX:2:13108:16341:13952
               flag: 16
       reference ID: 1
           position: 3279555
    mapping quality: 255
              CIGAR: 75M
  next reference ID: 0
      next position: 0
    template length: 0
           sequence: ACCAAAGGTCCACCTTGCCTTCACCTTATTTGTCCTAAATTTCTTGCACACCGATATCAAAGGCCAGTCTGCAGC
       base quality: UInt8[0x24, 0x24, 0x0e, 0x20, 0x24, 0x20, 0x24, 0x0e, 0x0e, 0x0e, 0x20, 0x24, 0x0e, 0x0e, 0x0e, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x0e, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x0e, 0x24, 0x0e, 0x0e, 0x24, 0x24, 0x20, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x20, 0x24, 0x24, 0x24, 0x24, 0x0e, 0x24, 0x24, 0x20, 0x15, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x20, 0x24, 0x24, 0x15, 0x24, 0x24, 0x20, 0x24, 0x24, 0x20, 0x20, 0x20, 0x20, 0x15]
     auxiliary data: AS=-3 XN=0 XM=1 XO=0 XG=0 NM=1 MD=8C66 YT=UU NH=1XAM.BAM.Record:
      template name: NS500723:19:H5LLWBGXX:2:12211:9394:16712
               flag: 16
       reference ID: 1
           position: 3532296
    mapping quality: 255
              CIGAR: 52M
  next reference ID: 0
      next position: 0
    template length: 0
           sequence: TTTATCAAGTGTGGGTCGGAGCCAAGCAGCACGAACTGCAGGTGAACTCCAT
       base quality: UInt8[0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x20, 0x20, 0x20, 0x20, 0x20]
     auxiliary data: AS=-5 XN=0 XM=1 XO=0 XG=0 NM=1 MD=4A47 YT=UU NH=1XAM.BAM.Record:
      template name: NS500723:19:H5LLWBGXX:3:13405:17228:11793
               flag: 0
       reference ID: 1
           position: 3532529
    mapping quality: 255
              CIGAR: 75M
  next reference ID: 0
      next position: 0
    template length: 0
           sequence: GGGAACACGTCACCAGTCACCTACAACATGTTCGGACACTTCAAGTTCTGCATCACTCTCTGCGGAGGATACATT
       base quality: UInt8[0x20, 0x20, 0x20, 0x20, 0x20, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24]
     auxiliary data: AS=-5 XN=0 XM=1 XO=0 XG=0 NM=1 MD=7T67 YT=UU NH=1XAM.BAM.Record:
      template name: NS500723:19:H5LLWBGXX:4:11504:6970:1796
               flag: 16
       reference ID: 1
           position: 3532535
    mapping quality: 255
              CIGAR: 75M
  next reference ID: 0
      next position: 0
    template length: 0
           sequence: ACGTCACCAGTCACCTACAACATGTTCGGACACTTCAAGTTCTGCATCACTCTCTGCGGAGGATACATTCTATTT
       base quality: UInt8[0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x20, 0x20, 0x20, 0x20, 0x20]
     auxiliary data: AS=-5 ZS=-10 XN=0 XM=1 XO=0 XG=0 NM=1 MD=1T73 YT=UU NH=1XAM.BAM.Record:
      template name: NS500723:19:H5LLWBGXX:1:13307:15032:13241
               flag: 16
       reference ID: 1
           position: 3532548
    mapping quality: 255
              CIGAR: 39M
  next reference ID: 0
      next position: 0
    template length: 0
           sequence: CCTACAACATGTTCGGACACTTCAAGTTCTGCATCACTC
       base quality: UInt8[0x0e, 0x24, 0x20, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x0e, 0x24, 0x24, 0x0e, 0x15, 0x24, 0x15, 0x24, 0x24, 0x20, 0x20, 0x20, 0x0e, 0x24, 0x20, 0x24, 0x24, 0x20, 0x24, 0x24, 0x24, 0x20, 0x20, 0x20, 0x20, 0x20]
     auxiliary data: AS=0 ZS=-5 XN=0 XM=0 XO=0 XG=0 NM=0 MD=39 YT=UU NH=1
julia> testfun(bam_file, "chrZ", 3205900, 3516344)
ERROR: MethodError: no method matching overlapchunks(::Indexes.BGZFIndex, ::Nothing, ::UnitRange{Int64})
Closest candidates are:
  overlapchunks(::Indexes.BGZFIndex, ::Integer, ::UnitRange) at /Users/ozakiharuka/.julia/packages/Indexes/fj340/src/bgzfindex.jl:41
Stacktrace:
 [1] iterate(::XAM.BAM.OverlapIterator{IOStream}) at /Users/ozakiharuka/.julia/packages/XAM/ahh4D/src/bam/overlap.jl:54
 [2] (::var"#9#10"{String,Int64,Int64})(::XAM.BAM.Reader{IOStream}) at ./REPL[25]:4
 [3] #open#2(::Base.Iterators.Pairs{Symbol,String,Tuple{Symbol},NamedTuple{(:index,),Tuple{String}}}, ::typeof(open), ::var"#9#10"{String,Int64,Int64}, ::Type{XAM.BAM.Reader}, ::String) at /Users/ozakiharuka/.julia/packages/BioGenerics/cCuGr/src/IO.jl:48
 [4] #open at ./none:0 [inlined]
 [5] testfun(::String, ::String, ::Int64, ::Int64) at ./REPL[25]:3
 [6] top-level scope at REPL[41]:1

Context

Your Environment

Package Version used: v"0.2.3"
Julia Version used: Version 1.3.1
Operating System and version (desktop or mobile): macOS mojave (10.14.6)
Link to your project:

Has flag() renamed to flags()?

When I ran a previous code I faced an error, since BAM.flag() function no longer existed. I found these was other undocumented function XAM.flags(). It seems XAM.flags() behaves identically to the previous BAM.flag() function after a try. Is that XAM.flags() is just a rename of BAM.flag() in the current and future versions, or is there another "formal" way I should use to access the flags data of read mapping information? Thank you!

eachoverlap running past end of stream

When looping trough a BAM file with eachoverlap I get an error if the selected interval intersects past the last read of the BAM file.

Expected Behavior

samtools view -b chr1:1-10000 bigBAM.bam > small.bam
samtools index small.bam

reader = open(BAM.Reader, "small.bam", index="small.bam.bai")
for record in eachoverlap(reader, "chr1", 1:1000000)
    # ...
    # ...
end
close(reader)

Loops correctly trough all the reads of the small bam file.

Current Behavior

The loop manages to read the last BAM record but then fails with a:

Stacktrace:
 [1] getindex at ./array.jl:728 [inlined]
 [2] virtualoffset(::BGZFStreams.BGZFStream{IOStream}) at /nfs/users/nfs_r/ra11/.julia/packages/BGZFStreams/qApsr/src/bgzfstream.jl:156
 [3] iterate(::BioAlignments.BAM.OverlapIterator{IOStream}, ::BioAlignments.BAM.OverlapIteratorState) at /nfs/users/nfs_r/ra11/.julia/packages/BioAlignments/FOisL/src/bam/overlap.jl:65
 [4] top-level scope at ./REPL[4]:2

Possible Solution / Implementation

Steps to Reproduce (for bugs)

Same as stated above. Take a section of a BAM and run

Context

Your Environment

Package Version used: 1.0.1 & 2.0.0
Julia Version used: 1.2.0
Operating System and version (desktop or mobile): Linux
Link to your project:

while loop got eof error

Hi, I run XAM follow your code with while loop but I got an eof error:

And I change to for loop it works fine:

How should I solve it,thanks!

Type instability when reading a BAM file

I don't know if this is a significant bottleneck, but I noticed some type instability when looping over BAM file records:

function read_bam(bam_file)
# iterate over BAM records
chr = String[]
left_pos = Int64[]
right_pos = Int64[]
for record in reader
push!(chr, BAM.refname(record))
push!(left_pos, leftposition(record))
push!(right_pos, rightposition(record))
end
DataFrame(chr = chr, left = left_pos, right = right_pos)
end

@code_warntype read_bam(some_file)

!julia> @code_warntype read_bam(bam_file)
Variables:
#self#
bam_file
record::ANY
#temp#::ANY
chr::Array{String,1}
left_pos::Array{Int64,1}
right_pos::Array{Int64,1}
itemT

Body:
begin
chr::Array{String,1} = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{String,1}, svec(Any, Int64), Array{String,1}, 0, 0, 0)) # line 4:
left_pos::Array{Int64,1} = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Int64,1}, svec(Any, Int64), Array{Int64,1}, 0, 0, 0)) # line 5:
right_pos::Array{Int64,1} = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Int64,1}, svec(Any, Int64), Array{Int64,1}, 0, 0, 0)) # line 6:
SSAValue(0) = Main.reader
#temp#::ANY = (Base.start)(SSAValue(0))::ANY
9:
unless !((Base.done)(SSAValue(0), #temp#::ANY)::ANY)::ANY goto 39
SSAValue(1) = (Base.next)(SSAValue(0), #temp#::ANY)::ANY
record::ANY = (Core.getfield)(SSAValue(1), 1)::ANY
#temp#::ANY = (Core.getfield)(SSAValue(1), 2)::ANY # line 7:
SSAValue(5) = ($(QuoteNode(BioAlignments.BAM.refname)))(record::ANY)::String
$(Expr(:inbounds, false))
# meta: location array.jl push! 652
SSAValue(6) = (Base.bitcast)(UInt64, (Base.check_top_bit)(1)::Int64)
$(Expr(:foreigncall, :(:jl_array_grow_end), Void, svec(Any, UInt64), :(chr), 0, SSAValue(6), 0)) # line 653:
# meta: location abstractarray.jl endof 134
# meta: location abstractarray.jl linearindices 99
# meta: location abstractarray.jl indices1 71
# meta: location abstractarray.jl indices 64
SSAValue(9) = (Base.arraysize)(chr::Array{String,1}, 1)::Int64
# meta: pop location
# meta: pop location
# meta: pop location
# meta: pop location
(Base.arrayset)(chr::Array{String,1}, SSAValue(5), (Base.select_value)((Base.slt_int)(SSAValue(9), 0)::Bool, 0, SSAValue(9))::Int64)::Array{String,1}
# meta: pop location
$(Expr(:inbounds, :pop)) # line 8:
(Main.push!)(left_pos::Array{Int64,1}, (Main.leftposition)(record::ANY)::ANY)::Array{Int64,1} # line 9:
(Main.push!)(right_pos::Array{Int64,1}, (Main.rightposition)(record::ANY)::ANY)::Array{Int64,1}
37:
goto 9
39: # line 11:
return $(Expr(:invoke, MethodInstance for (::Core.#kw#Type)(::Array{Any,1}, ::Type{DataFrames.DataFrame}), :($(QuoteNode(Type))), :($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any)$
end::DataFrames.DataFrame

I think the ANY type for record and temp indicate some type instability

Finding a map between observed bases and positions in the reference genome

Hi,

Thank you for building and maintaining this great package --- I wouldn't start analyzing genomic data via Julia without it!

I have a steps that I used to perform using pysam, but could not figure how they can be done via XAM:

For each record/read, get a list of start and end positions of aligned gapless blocks.
For each record/read, get a list of reference positions that this read aligns to. The length of the list should be the same length as the read. This should allow mapping each observed base to a position in the reference genome, in the presence of insertions and deletions.

Would you be able to guide me how this can be done?

Sincerely,
Daniel

Feature Request: Conversion from `BioAlignments` to `SAM.Record`

Previously seen on Slack

It would be great if there was some way to convert an alignment into a SAM/BAM.Record so that it could be exported to the filesystem for future use. Note that I will only refer to SAM.Record from here on, but the same would be expected for BAM.Record, along the lines of #25.

Expected Behavior

Ideally, this would simply be a new method for the SAM.Record constructor. Something like

XAM.SAM.Record(aln::PairwiseAlignment, kwargs...)

There would need to be other information (possibly with defaults?) added, as a SAM.Record contains more fields than

Current Behavior

There is no such behavior.

Possible Solution / Implementation

The challenge here is that SAM.Records store more information than PairwiseAlignments. Here's a map of what I think could be inferred vs. what needs to be passed as an argument.

Column	Field name	`PairwiseAlignment` accessor	Possible default
1	QNAME	N/A	N/A
2	FLAG	N/A*	N/A
3	RNAME	N/A	N/A
4	POS	N/A	1**
5	MAPQ	unknown	255
6	CIGAR	cigar(aln.a.aln)	N/A
7	RNEXT	N/A	N/A***
8	PNEXT	N/A	N/A***
9	TLEN	length(aln.a.seq)	N/A
10	SEQ	aln.a.seq	N/A
11	QUAL	N/A	"*"
12	NM****	count_mismatches(aln)	N/A

Notes

*

It might be possible to compute some of the flags based on the sequence, but I don't know how to do so.

**

In my experience, BioAlignments.cigar generates an alignment that always starts at position 1, and then pads it with deletions.

***

These fields are context-sensitive. Maybe generating records should be stateful 🤷?

****

Technically, the Number of Mismatches field is optional, but it is included in the recommended practices, and some tools fail if it isn't included.

Context

I was working on generating very specific alignments for variant calling purposes, and found that I couldn't export the result to another tool. Having this functionality would remedy that.

Migrate from BGZFStreams to CodecBGZF

Dear @CiaranOMara

I haven't lost track of the existing issues, and will get around to resolving them!

However, I got sidetracked with a more fundamental issue: The BGZFStreams package is slow, and not well integrated with the rest of the Julia IO ecosystem. Kenta Sato, the author of BGZFStreams.jl realized this, and created CodecBGZF.jl three years ago, but it looks like that project has been abandoned, and Kenta is a busy guy, hard to reach these days. See also this comment from Kenta Sato

In fact, BGZFStreams is the major bottleneck of XAM.jl, to the extent that optimizing XAM does not make any practical difference because it is so bogged down by BGZFStreams. For that reason, I created CodecBZFG myself. It's faster, safer, and more generic than BGZFStreams. I'd expect XAM's BAM module to be around 2x faster with CodecBGZF, perhaps more.

It'll be a bit of a project, since XAM depends on Indexes, which also depends on BGZFStreams.

I'll make this issue here, and then hopefully get around to it the following few months. If you support this change, and feel like giving it a crack, please be my guest! It'll also be useful to have other eyes on CodecBGZF to make sure I didn't bork the interface.

Export and document some more things?

First, thanks for this package! I made some functions for creating objects from my GenomicVectors.jl package using your BAM reader functions. I think I found the right functions.

Chromosome names and lengths:
function GenomeInfo(name, reader::BioAlignments.BAM.Reader)
GenomeInfo(name,
reader.refseqnames,
reader.refseqlens
)
end

Chromosome name and alignment start and end positions:
read_bam = function(bam_file)
# iterate over BAM records
chr = String[]
left_pos = Int64[]
right_pos = Int64[]
for record in reader
push!(chr, BAM.refname(record))
push!(left_pos, leftposition(record))
push!(right_pos, rightposition(record))
end
DataFrame(chr = chr, left = left_pos, right = right_pos)
end

Should/could reader.refseqnames, reader.refseqlens and BAM.refname be exported? I'd be happy to add the exports, some docstrings and a bit of README text if you are open to it.

can you add some functions to BAM.Reader?

can you add some new functions to the whole BAM class? for examples,

get mapped and unmaped read numbers for every reference
get all reference names
...
check if one reference name was in bam files

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Do you have the plan to add the pileup functions?

Sometimes,We want to iter bam according to the position not record. So do you have the plan to add the function.

Add new interfaces for manipulating auxilary BAM fields

I've been working a bit with BAM files using BioAlignments the past days and have been looking through the source code. During the work, there was a few function I found that I missed from BioAlignments which would have my work easier. They would all be quite easy to implement, and I'd love to pitch a PR with the new interfaces + tests, but being interfaces, they're hard to change once released and very annoying to get wrong. So, let's discuss them first!

Phase out AuxData, or make it view-based.

Currently, an AuxData object can be created from the auxiliary data of a BAM record. In the current implementation It necessitates a copy of the data from the record, meaning that the user can manipulate the AuxData and seemingly succeeding, but without the record actually changing. Furthermore, most interfaces (all except iterate, AFAIK) for aux fields is currently implemented to work on Vector{UInt8} or on the records directly, making the AuxData type mostly obsolete.
I suggest removing the object alltogether to avoid code duplication of functions acting on AuxData and Records and confusion about what the user is mutating. If this is not acceptable, then making the AuxData view-based.

`get(record::BAM.Record, key::AbstractString, default)`

Records work sort of like AbstractDict. In fact, the implemented AuxData type is a subtype of AbstractDict. But asking for a value with a default is very awkward for the user. I suggest adding this method, which is to completely mirror how it works for Dicts.

`delete!(record::BAM.Record, key::AbstractString)`

Actually changing the auxiliary fields of BAM.Record objects is difficult. Again, this one should simply mirror the similar method for Dicts. One could also think about adding pop!.

`setindex!(record::BAM.Record, key::AbstractString, value)`

This would also be useful to have. The value would be encoded and appended to the end of the data array. Probably a few different methods would have to be written, since the type of Value should only be an UInt8, Int32, Float32, String, or Vector{T} where T can be UInt8, Int8, UInt16, Int16, UInt32, Int32 or Float32.

Some new interface for making headers more easy

People can make all kinds of custom headers, but usually they just want the @SQ headers updated plus whatever non-SQ headers the program created. It's worth noting that BioAlignments make no checks for the header and records matching together, gladly writing a healthy-looking but unparsable (by htslib) BAM.file, so making it easy to not mess up the headers is rather important. I suggest making it easy to update the SQ headers. I'd like any suggestion here.
I can imagine two new constructors for SAM.Header, namely SAM.Header(h::Sam.Header, records::Vector{BAM.Record}) and SAM.Header(records::Vector{BAM.Record}), the latter creating a header with the appropriate "SQ" headers, the former with the same, but keeping any non-SQ tags already present.

XAM compatibilty for Automa very old, please update?

Hi,

Could someone update Automa dependency ? XAM is restricted to 0.7, 0.8 meanwhile latest is 1.2.1. It's complicated to explain, but some packages that fix some bugs in latest version require latest Automa and this is conflicting with XAM requirements which are a bit out dated.

[WIP] convert SAM to BAM

I'm trying to convert SAM record to a BAM one. I think I got the data layout roughly correct but the cigar and sequences need to be encoded differently or something. For context I now have paired read alignement working in BurrowsWheelerAligner.jl, so it would be possible to do a FASTQ to BAM & results directly in Julia. In general it would be nice to have more support for creating BAM alignment from scratch, modify them, etc. (https://discourse.julialang.org/t/change-record-sequence-in-a-bam-file-using-xam/98141/2) and I think converting SAM to BAM is a good starting point.

Any help would be appreciated.

sam = "seq1\t81\tPhiX\t1051\t60\t70M\t=\t1821\t702\tTCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGAC\t*\tNM:i:0\tMD:Z:70\tMC:Z:70M\tAS:i:70\tXS:i:0"

r = XAM.SAM.Record(sam)

cigar_ops = ('M','I','D','N','S','H','P','=','X')

br = BAM.Record()

# fixed-length fields (see BMA specs for the details)
br.refid = 1 #TODO : get from header
br.pos = SAM.position(r)
br.l_read_name = SAM.tempname(r) |> x -> length(x)+1 #why do I need + 1 ?
br.mapq = SAM.mappingquality(r)
#br.bin
br.n_cigar_op = sum(x ∈ cigar_ops for x in SAM.cigar(r))
br.flag = SAM.flag(r)
br.l_seq = SAM.seqlength(r)
br.next_refid = 1 #TODO : get from header
br.next_pos = SAM.nextposition(r)
br.tlen = SAM.templength(r)

# variable length data

t = x -> unsafe_wrap(Vector{UInt8}, x)

br.data = vcat(
    SAM.tempname(r) |> t,
    SAM.cigar(r) |> t,
    SAM.sequence(String, r) |> t,
    fill(0x41, length(SAM.sequence(r))),
    #SAM.quality(r),
    reduce(vcat, r.data[f] for f in r.fields) # aux data
)
br.block_size = length(br.data)

br

XAM.BAM.Record:
      template name: seq1
               flag: 81
       reference ID: 2
           position: 1052
    mapping quality: 60
              CIGAR: 70599891M
  next reference ID: 2
      next position: 1822
    template length: 702
           sequence: RGRGGVGVGMRGRGGMGMRGRGGVGMRGGVGVRGGMGAGVGARGRGGVGVRGGMGVRGGMRGRGGARGRG
       base quality: UInt8[0x41, 0x43, 0x43, 0x41, 0x54, 0x54, 0x54, 0x43, 0x41, 0x41  …  0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41]
     auxiliary data: AA=A AA=A AA=A AA=A AA=A AA=A AA=A AA=AERROR: invalid type tag: 'M'
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] auxtype
    @ ~/.julia/packages/XAM/l6LT2/src/bam/auxdata.jl:63 [inlined]
  [3] loadauxtype
    @ ~/.julia/packages/XAM/l6LT2/src/bam/auxdata.jl:82 [inlined]

Position of unmapped mate

Hi @CiaranOMara, I encountered this issue as well and I put together a BAM file that has two reads with the same position (one is the other's unmapped mate) for reproducing the issue. I find that the eof check fix (e.g. with feature/CodecBGZF-issue-31 branch) will actually skip one of the reads. See https://gist.github.com/Marlin-Na/9cb7767dfc87b007500a6fff4a8f5b9a

Originally posted by @Marlin-Na in #31 (comment)

I think my issue was that if a read is unmapped, its rightposition(record) will become position(record) - 1, which is >smaller than the leftposition. I think the check here https://github.com/BioJulia/XAM.jl/blob/develop/src/bam/overlap.jl#L87 >should account for this special case. Probably change it to the following? It worked for me.
   if rid < interval[1] || (rid == interval[1] && rightposition(record) < first(interval[2]) &&  position(record) < first(interval[2]))
and
   if rid > interval[1] || (rid == interval[1] && position(record) > last(interval[2]) && rightposition(record) > last(interval[2]))

Originally posted by @Marlin-Na in #31 (comment)

biojulia / xam.jl Goto Github PK

xam.jl's Issues

Expected Behavior

Current Behavior

Possible Solution / Implementation

Steps to Reproduce (for bugs)

Context

Your Environment

List of errors/inconsistencies

Proposals/enhancements

Inconsistent BAM flag accessor functions

Expected Behavior

Current Behavior

Your Environment

Expected Behavior

Current Behavior

Possible Solution / Implementation

Steps to Reproduce (for bugs)

Your Environment

Problem

Environment

Expected Behavior

Current Behavior

Possible Solution / Implementation

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behavior

Current Behavior

Possible Solution / Implementation

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behavior

Current Behavior

Possible Solution / Implementation

Notes

*

**

***

****

Context

Add new interfaces for manipulating auxilary BAM fields

Phase out AuxData, or make it view-based.

get(record::BAM.Record, key::AbstractString, default)

delete!(record::BAM.Record, key::AbstractString)

setindex!(record::BAM.Record, key::AbstractString, value)

Some new interface for making headers more easy

Recommend Projects

Recommend Topics

Recommend Org

`get(record::BAM.Record, key::AbstractString, default)`

`delete!(record::BAM.Record, key::AbstractString)`

`setindex!(record::BAM.Record, key::AbstractString, value)`