biojulia / xam.jl Goto Github PK
View Code? Open in Web Editor NEWParse and process SAM and BAM formatted files
License: MIT License
Parse and process SAM and BAM formatted files
License: MIT License
I have used XAM.jl to treat BAM files. Today I came across an error when opening a BAM file generated by HISAT2 version 2.2.0. No error occurred for a BAM file generated by HISAT2 version 2.1.0 (from the same FASTQ file)
No errors are expected.
julia> using XAM
julia> reader = open(BAM.Reader, "SRR8452726_1.sort.bam")
julia> using XAM
julia> reader = open(BAM.Reader, "SRR8452726_1.sort.bam")
ERROR: ArgumentError: malformed metainfo at line 68
Stacktrace:
[1] readheader!(::TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,Base.GenericIOBuffer{Array{UInt8,1}}}, ::XAM.SAM.Header, ::Tuple{Int64,Int64}) at /Users/ozakiharuka/.julia/packages/XAM/ahh4D/src/sam/readrecord.jl:318
[2] Reader at /Users/ozakiharuka/.julia/packages/XAM/ahh4D/src/sam/reader.jl:13 [inlined]
[3] Reader at /Users/ozakiharuka/.julia/packages/XAM/ahh4D/src/sam/reader.jl:38 [inlined]
[4] init_bam_reader(::BGZFStreams.BGZFStream{IOStream}) at /Users/ozakiharuka/.julia/packages/XAM/ahh4D/src/bam/reader.jl:96
[5] init_bam_reader at /Users/ozakiharuka/.julia/packages/XAM/ahh4D/src/bam/reader.jl:125 [inlined]
[6] #Reader#2 at /Users/ozakiharuka/.julia/packages/XAM/ahh4D/src/bam/reader.jl:36 [inlined]
[7] Reader at /Users/ozakiharuka/.julia/packages/XAM/ahh4D/src/bam/reader.jl:31 [inlined]
[8] #open#1 at /Users/ozakiharuka/.julia/packages/BioGenerics/cCuGr/src/IO.jl:42 [inlined]
[9] open(::Type{XAM.BAM.Reader}, ::String) at /Users/ozakiharuka/.julia/packages/BioGenerics/cCuGr/src/IO.jl:42
[10] top-level scope at REPL[2]:1
(v1.3) pkg> status
....
[d759349c] XAM v0.2.3
....
julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.6.0)
CPU: Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
I have no idea but the release note of HISAT2 might help.
HISAT 2.2.0 release 2/6/2020
This major version update includes a new feature to handle “repeat” reads. Based on sets of 100-bp simulated and 101-bp real reads that we tested, we found that 2.6-3.4% and 1.4-1.8% of the reads were mapped to >5 locations and >100 locations, respectively. Attempting to report all alignments would likely consume a prohibitive amount of disk space. In order to address this issue, our repeat indexing and alignment approach directly aligns reads to repeat sequences, resulting in one repeat alignment per read. HISAT2 provides application programming interfaces (API) for C++, Python, and JAVA that rapidly retrieve genomic locations from repeat alignments for use in downstream analyses.
Other minor bug fixes are also included as follows:Fixed occasional sign (+ or -) issues of template lengths in SAM file
Fixed duplicate read alignments in SAM file
Skip a splice site if exon’s last base or first base is ambiguous (N)
julia> using XAM
julia> reader = open(BAM.Reader, "SRR8452726_1.sort.bam")
(v1.3) pkg> status
Status `~/.julia/environments/v1.3/Project.toml`
[c7e460c6] ArgParse v1.1.0
[8e4a8c10] BED v0.1.0
[00701ae9] BioAlignments v2.0.0
[7e6ae17a] BioSequences v2.0.5
[336ed68f] CSV v0.6.1
[9961bab8] Cbc v0.6.7
[aaaa29a8] Clustering v0.14.0
[8f4d0f93] Conda v1.4.1
[a93c6f00] DataFrames v0.20.2
[968ba79b] DocOpt v0.4.0
[8f5d6c58] EzXML v1.1.0
[c2308a5c] FASTX v1.1.2
[652a1917] Fire v0.1.0
[587475ba] Flux v0.11.0
[899a7d2d] GenomicFeatures v2.0.0
[c27321d9] Glob v1.3.0
[a2cc645c] GraphPlot v0.4.2
[cd3eb016] HTTP v0.8.16
[7073ff75] IJulia v1.21.2
[682c06a0] JSON v0.21.0
[4076af6c] JuMP v0.21.2
[093fc24a] LightGraphs v1.3.1
[9b87118b] PackageCompiler v1.1.1
[91a5bcdd] Plots v1.0.10
[d330b81b] PyPlot v2.9.0
[777a009e] ReadCoverage v0.1.1 #master (https://github.com/bioinfo-tsukuba/ReadCoverage.jl)
[3cdcf5f2] RecipesBase v1.0.0
[01d81517] RecipesPipeline v0.1.3
[2913bbd2] StatsBase v0.33.0
[70df011a] TableReader v0.4.0
[9c690861] TensorToolbox v1.0.1
[d759349c] XAM v0.2.3
[ddb6d928] YAML v0.4.0
[8bb1440f] DelimitedFiles
[de0858da] Printf
Dear @CiaranOMara
I've been working a bit with XAM.jl. Great package, but I've found quite a number of small errors and mistakes I think should be fixed. Making one PR for each would be overwhelming, especially since fixing some of them has implications that are not straightforward. So I'd like to discuss it a bit here:
I've implemented many of these changes at https://github.com/jakobnissen/XAM.jl, but that "fork" is stale, and I'd rather merge the fixes into the original XAM.jl
I've observed a memory leak with the in-place reading/iteration pattern.
using XAM
using ProgressMeter
io = open(file_sam, "r")
reader = XAM.SAM.Reader(io)
p = Progress(filesize(file_sam))
record = XAM.SAM.Record()
while !eof(reader)
read!(reader, record)
#...
ProgressMeter.update!(p, position(reader.state.stream),showvalues = [(:position,position(reader.state.stream))])
end
close(io)
Currently, it's possible to create a spec-compliant BAM record with the following behaviour:
julia> BAM.hasrefname(record)
true
julia> BAM.refname(record)
ERROR: ArgumentError: record is not mapped
Stacktrace:
[1] checked_refid at /home/jakni/.julia/packages/BioAlignments/AGQde/src/bam/record.jl:162 [inlined]
[2] refname(::BioAlignments.BAM.Record) at /home/jakni/.julia/packages/BioAlignments/AGQde/src/bam/record.jl:178
[3] top-level scope at none:0
(on a related note, the function BAM.hasnextrefname
simply doesn't work.)
What's going on?
Well, for a subset (which subset, I'm not quite sure) of the flags, there is a corresponding flag accessor isX
function and a function hasX
. I'm not entirely sure what the purpose of the hasX
function are - some of them seem to be exported to BioCore. It might be I'm simply confused as to what the purpose of these functions are, if so, please tell me.
Anyway, the behaviour of these functions are inconsistent and highly unintuitive. For example, ismapped
, isprimary
and ispositivestrand
all raise an error if the record is unfilled, else return a boolean. However isnextmapped
returns false
with an unfilled record.
Furthermore, hasrefid
, hasrefname
, hasnextrefid
returns whether the record is filled, NOT whether the record is aligned (i.e. actually has a refid etc.). So a valid, unmapped record can have an ID of -1 and still return true
when asked if hasrefid
. The similar function hasnextrefname
, however, actually does check whether it has a valid next reference name.
Also, refname
and nextrefname
checks for the validity of the corresponding IDs before fetching the name, whereas hasrefname
does not (hence the error at the beginning of this issue).
hasposition
returns true
if the Record is unmapped (i.e. has an invalid position), but hasrightposition
returns false
in that case.
hasflag
seem to me to be meaningless. All BAM records by definition has flags. Currently, it simply returns isfilled(record)
, and is not used in any other code.
This seems very unsystematic. I propose changing all these functions to follow the following rules:
hasX
returns true
, then isX
will always return a boolean which is guaranteed to be correct.hasX
returns false
, then isX
will always throw an errorhasrefid
, hasrefname
, hasnextrefid
and hasnextrefname
will return false
if the record is unfilled, or if the record/mate is unmapped, else true
hasposition
, hasnextposition
and hasrightposition
returns false
is the record is unfilled or the record/mate is unmapped, else true
Hi,
Would it be possible to add eachoverlap
such that it can be browsed by chromosome when the BAM is indexed ?
chrom = "chr1"
itr = eachoverlap(reader, chrom)
chrom = "chr1"
idx = findfirst(isequal(chrom), reader.refseqnames)
refseqlen = reader.refseqlens[idx]
itr = eachoverlap(reader, chrom, 1:refseqlen))
Thanks,
I would contend that Oneflow is a better branching model than Git Flow for Julia packages. BioAlignments made the transition (albeit, a rocky one). What do others think?
See BioJulia/BioGenerics.jl#14 (comment).
See BioJulia/FASTX.jl@c66d035 for an implementation example.
how to use @threads to read BAM file more than one core?
How do you browse the documentation? Is it possible to simply include a link in the Readme that links with the documentation markdown files?
Make SAM and BAM records display similarly
A SAM and BAM record contain basically the same information. So they should be displayed similarly in the REPL.
Create SAM-to-BAM conversion functions
This is a little harder, but probably worthwhile.
Today I faced a problem on installing XAM.jl:
(v1.6) pkg> add XAM
Resolving package versions...
ERROR: Unsatisfiable requirements detected for package BioSequences [7e6ae17a]:
BioSequences [7e6ae17a] log:
├─possible versions are: 1.0.0-2.0.5 or uninstalled
├─restricted to versions * by an explicit requirement, leaving only versions 1.0.0-2.0.5
├─restricted by compatibility requirements with XAM [d759349c] to versions: 2.0.0-2.0.5
│ └─XAM [d759349c] log:
│ ├─possible versions are: 0.1.0-0.2.7 or uninstalled
│ └─restricted to versions * by an explicit requirement, leaving only versions 0.1.0-0.2.7
└─restricted by compatibility requirements with Bio [3637df68] to versions: 1.0.0-1.1.0 — no versions left
└─Bio [3637df68] log:
├─possible versions are: 1.0.0-1.0.1 or uninstalled
└─restricted to versions * by an explicit requirement, leaving only versions 1.0.0-1.0.1
It is strange since I have already installed BioSequences.jl:
(v1.6) pkg> status BioSequences
Status `/data/projects/Project_JIE/programs/CONFIG_FILES_AT_HOME_DIR/dot_julia/environments/v1.6/Project.toml`
[7e6ae17a] BioSequences v1.1.0
I'm on a Linux server. The Julia info is:
julia> versioninfo()
Julia Version 1.6.1
Commit d5c1cfd* (2021-05-06 13:40 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E7-4880 v2 @ 2.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, ivybridge)
Does anyone know the reason? Thanks.
I'm trying to read a BAM file and add only mapped records to an array. But irrespective of the BAM file used the return records array is filled with the same record (the record changes when I change the file but with same file its all filled same record).
function read_bam(bam_path)
rec = []
reader = open(BAM.Reader, bam_path)
record = BAM.Record()
while !eof(reader)
empty!(record)
read!(reader, record)
if BAM.ismapped(record)
push!(rec, record)
end
end
return rec
end
Am I doing something wrong here? Is there a way to achieve this?
I tried to count the number of BAM.Record objects overlapped with a specified genomic region by using eachoverlap(bam_reader, chrom, leftpos:rightpos)
julia> using GenomicFeatures
julia> using XAM
julia> using BioAlignments
julia> function test_bam_eachoverlap(path_bam, chrom, leftpos, rightpos)
bam_reader = open(BAM.Reader, path_bam, index = path_bam * ".bai")
count = 0
for record in eachoverlap(bam_reader, chrom, leftpos:rightpos)
count += 1
end
close(bam_reader)
return count
end
This BAM file (SRR7993829_1.100K.forward.bam
) has one record for JH584304.1:
$ samtools view SRR7993829_1.100K.forward.bam JH584304.1
SRR7993829.4835529.1 272 JH584304.1 51715 0 15M559N60M * 0 0 CAAGCAGAACCTTTGGTCATGCCAGCGACAAGCCAACAGCCTGCGTAGGTGGCTGGAGCTAAGCCAGGAGGAGCG JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA AS:i:-5 ZS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:74C0 YT:Z:UU XS:A:+ NH:i:2
So, the expected behavior is as follows:
julia> test_bam_eachoverlap("SRR7993829_1.100K.forward.bam", "JH584304.1", 51000, 51200)
0
julia> test_bam_eachoverlap("SRR7993829_1.100K.forward.bam", "JH584304.1", 51000, 51715)
1
I got an BoundsError:
julia> test_bam_eachoverlap("SRR7993829_1.100K.forward.bam", "JH584304.1", 51000, 51200)
0
julia> test_bam_eachoverlap("SRR7993829_1.100K.forward.bam", "JH584304.1", 51000, 51715)
ERROR: BoundsError: attempt to access 1-element Array{BGZFStreams.Block,1} at index [2]
Stacktrace:
[1] getindex at ./array.jl:744 [inlined]
[2] virtualoffset(::BGZFStreams.BGZFStream{IOStream}) at /Users/haruka/.julia/packages/BGZFStreams/qApsr/src/bgzfstream.jl:156
[3] iterate(::XAM.BAM.OverlapIterator{IOStream}, ::XAM.BAM.OverlapIteratorState) at /Users/haruka/.julia/packages/XAM/9nk3g/src/bam/overlap.jl:65
[4] test_bam_eachoverlap(::String, ::String, ::Int64, ::Int64) at ./REPL[83]:5
[5] top-level scope at REPL[85]:1
julia> test_bam_eachoverlap("SRR7993829_1.100K.forward.bam", "JH584304.1", 51715, 51716)
ERROR: BoundsError: attempt to access 1-element Array{BGZFStreams.Block,1} at index [2]
Stacktrace:
[1] getindex at ./array.jl:744 [inlined]
[2] virtualoffset(::BGZFStreams.BGZFStream{IOStream}) at /Users/haruka/.julia/packages/BGZFStreams/qApsr/src/bgzfstream.jl:156
[3] iterate(::XAM.BAM.OverlapIterator{IOStream}, ::XAM.BAM.OverlapIteratorState) at /Users/haruka/.julia/packages/XAM/9nk3g/src/bam/overlap.jl:65
[4] test_bam_eachoverlap(::String, ::String, ::Int64, ::Int64) at ./REPL[83]:5
[5] top-level scope at REPL[86]:1
I'm not sure how to fix the error, but I think that the error is relevant to the fact that the last record in this BAM file (SRR7993829_1.100K.forward.bam
) is on JH584304.1:
$ samtools idxstats hisat2/SRR7993829_1.100K.forward.bam
chr1 195471971 1565 0
chr2 182113224 2116 0
chr3 160039680 1599 0
chr4 156508116 1789 0
chr5 151834684 1699 0
chr6 149736546 1486 0
chr7 145441459 1626 0
chr8 129401213 1393 0
chr9 124595110 1659 0
chr10 130694993 1230 0
chr11 122082543 2123 0
chr12 120129022 1580 0
chr13 120421639 1018 0
chr14 124902244 937 0
chr15 104043685 913 0
chr16 98207768 7585 0
chr17 94987271 2645 0
chr18 90702639 946 0
chr19 61431566 783 0
chrX 171031299 921 0
chrY 91744698 51 0
chrM 16299 730 0
GL456210.1 169725 0 0
GL456211.1 241735 0 0
GL456212.1 153618 0 0
GL456213.1 39340 0 0
GL456216.1 66673 0 0
GL456219.1 175968 0 0
GL456221.1 206961 0 0
GL456233.1 336933 0 0
GL456239.1 40056 0 0
GL456350.1 227966 2 0
GL456354.1 195993 0 0
GL456359.1 22974 0 0
GL456360.1 31704 0 0
GL456366.1 47073 0 0
GL456367.1 42057 0 0
GL456368.1 20208 0 0
GL456370.1 26764 0 0
GL456372.1 28664 0 0
GL456378.1 31602 0 0
GL456379.1 72385 0 0
GL456381.1 25871 0 0
GL456382.1 23158 0 0
GL456383.1 38659 0 0
GL456385.1 35240 0 0
GL456387.1 24685 0 0
GL456389.1 28772 0 0
GL456390.1 24668 0 0
GL456392.1 23629 0 0
GL456393.1 55711 0 0
GL456394.1 24323 0 0
GL456396.1 21240 0 0
JH584292.1 14945 0 0
JH584293.1 207968 1 0
JH584294.1 191905 6 0
JH584295.1 1976 0 0
JH584296.1 199368 0 0
JH584297.1 205776 0 0
JH584298.1 184189 1 0
JH584299.1 953012 0 0
JH584300.1 182347 0 0
JH584301.1 259875 0 0
JH584302.1 155838 0 0
JH584303.1 158099 0 0
JH584304.1 114452 1 0
ERCC-00002 1061 0 0
ERCC-00003 1023 0 0
ERCC-00004 523 0 0
ERCC-00009 984 0 0
ERCC-00012 994 0 0
ERCC-00013 808 0 0
ERCC-00014 1957 0 0
ERCC-00016 844 0 0
ERCC-00017 1136 0 0
ERCC-00019 644 0 0
ERCC-00022 751 0 0
ERCC-00024 536 0 0
ERCC-00025 1994 0 0
ERCC-00028 1130 0 0
ERCC-00031 1138 0 0
ERCC-00033 2022 0 0
ERCC-00034 1019 0 0
ERCC-00035 1130 0 0
ERCC-00039 740 0 0
ERCC-00040 744 0 0
ERCC-00041 1122 0 0
ERCC-00042 1023 0 0
ERCC-00043 1023 0 0
ERCC-00044 1156 0 0
ERCC-00046 522 0 0
ERCC-00048 992 0 0
ERCC-00051 274 0 0
ERCC-00053 1023 0 0
ERCC-00054 274 0 0
ERCC-00057 1021 0 0
ERCC-00058 1136 0 0
ERCC-00059 525 0 0
ERCC-00060 523 0 0
ERCC-00061 1136 0 0
ERCC-00062 1023 0 0
ERCC-00067 644 0 0
ERCC-00069 1137 0 0
ERCC-00071 642 0 0
ERCC-00073 603 0 0
ERCC-00074 522 0 0
ERCC-00075 1023 0 0
ERCC-00076 642 0 0
ERCC-00077 273 0 0
ERCC-00078 993 0 0
ERCC-00079 644 0 0
ERCC-00081 534 0 0
ERCC-00083 1022 0 0
ERCC-00084 994 0 0
ERCC-00085 844 0 0
ERCC-00086 1020 0 0
ERCC-00092 1124 0 0
ERCC-00095 521 0 0
ERCC-00096 1107 0 0
ERCC-00097 523 0 0
ERCC-00098 1143 0 0
ERCC-00099 1350 0 0
ERCC-00104 2022 0 0
ERCC-00108 1022 0 0
ERCC-00109 536 0 0
ERCC-00111 994 0 0
ERCC-00112 1136 0 0
ERCC-00113 840 0 0
ERCC-00116 1991 0 0
ERCC-00117 1136 0 0
ERCC-00120 536 0 0
ERCC-00123 1022 0 0
ERCC-00126 1118 0 0
ERCC-00130 1059 0 0
ERCC-00131 771 0 0
ERCC-00134 274 0 0
ERCC-00136 1033 0 0
ERCC-00137 537 0 0
ERCC-00138 1024 0 0
ERCC-00142 493 0 0
ERCC-00143 784 0 0
ERCC-00144 538 0 0
ERCC-00145 1042 0 0
ERCC-00147 1023 0 0
ERCC-00148 494 0 0
ERCC-00150 743 0 0
ERCC-00154 537 0 0
ERCC-00156 494 0 0
ERCC-00157 1019 0 0
ERCC-00158 1027 0 0
ERCC-00160 743 0 0
ERCC-00162 523 0 0
ERCC-00163 543 0 0
ERCC-00164 1022 0 0
ERCC-00165 872 0 0
ERCC-00168 1024 0 0
ERCC-00170 1023 0 0
ERCC-00171 505 0 0
* 0 0 0
using GenomicFeatures
using XAM
using BioAlignments
function test_bam_eachoverlap(path_bam, chrom, leftpos, rightpos)
bam_reader = open(BAM.Reader, path_bam, index = path_bam * ".bai")
count = 0
for record in eachoverlap(bam_reader, chrom, leftpos:rightpos)
count += 1
end
close(bam_reader)
return count
end
test_bam_eachoverlap("SRR7993829_1.100K.forward.bam", "JH584304.1", 51000, 51715)
I posted this some time ago and someone mentioned that the performance of BAM.quality
wasn't optimal. I also see in some of my code that it takes a significant portion of time.
Is there a reason for that reinterpret and conversion business ?
https://github.com/BioJulia/BioAlignments.jl/blob/master/src/bam/record.jl#L548
I've tried the following and it seems to work fine:
function quality(record)
BAM.checkfilled(record)
seqlen = BAM.seqlength(record)
offset = BAM.seqname_length(record) + BAM.n_cigar_op(record, false) * 4 + BAM.cld(seqlen, 2)
return record.data[(1+offset):(seqlen+offset)]
end
Cut down the time from 232 to 65 ns. Returning a view cuts it by further by 2. Maybe there's some cases where that fails ? I saw the bam specifications says qualities are char and not uint8.
It seems that the optional fields aren't correctly parsed which in turn leads to an error raised by SAM.auxdata(record).
I find that some keys returned by keys(record) are incorrect.
For example, AS:i:0 returns :i as the key.
Make seqname
an alias of refname
to make seqname
consistent with the rest of the BioJulia ecosystem - namely GenomicFeatures.
Also, make BioGenerics.groupname
an alias of refname
. The groupname
method will be used in GenomicFeatures
v3.
XAM 0.2.6 doesn't work with Automa 0.8.1 (and because it's not pinned to 0.8.0, that version is installed by default)
Precompilation fails. Trace:
julia> using XAM
[ Info: Precompiling XAM [d759349c-bcba-11e9-07c2-5b90f8f05f7c]
ERROR: LoadError: LoadError: LoadError: Ambiguous DFA: Input 0x21 can lead to actions [:header, :mark, :pos] or [:header]
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] validate_paths(paths::Vector{Tuple{Union{Nothing, Automa.Edge}, Vector{Symbol}}})
@ Automa ~/.julia/packages/Automa/vXuxy/src/dfa.jl:179
[3] validate_nfanodes
@ ~/.julia/packages/Automa/vXuxy/src/dfa.jl:198 [inlined]
[4] nfa2dfa(nfa::Automa.NFA, unambiguous::Bool)
@ Automa ~/.julia/packages/Automa/vXuxy/src/dfa.jl:103
[5] compile(re::Automa.RegExp.RE; optimize::Bool, unambiguous::Bool)
@ Automa ~/.julia/packages/Automa/vXuxy/src/machine.jl:55
[6] compile
@ ~/.julia/packages/Automa/vXuxy/src/machine.jl:55 [inlined]
[7] map
@ ./tuple.jl:215 [inlined]
[8] map (repeats 2 times)
@ ./tuple.jl:216 [inlined]
[9] (::XAM.SAM.var"#28#29")()
@ XAM.SAM ~/.julia/packages/XAM/9nk3g/src/sam/readrecord.jl:146
[10] top-level scope
@ ~/.julia/packages/XAM/9nk3g/src/sam/readrecord.jl:11
[11] include(mod::Module, _path::String)
@ Base ./Base.jl:386
[12] include(x::String)
@ XAM.SAM ~/.julia/packages/XAM/9nk3g/src/sam/sam.jl:4
[13] top-level scope
@ ~/.julia/packages/XAM/9nk3g/src/sam/sam.jl:69
[14] include(mod::Module, _path::String)
@ Base ./Base.jl:386
[15] include(x::String)
@ XAM ~/.julia/packages/XAM/9nk3g/src/XAM.jl:1
[16] top-level scope
@ ~/.julia/packages/XAM/9nk3g/src/XAM.jl:7
[17] include
@ ./Base.jl:386 [inlined]
[18] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
@ Base ./loading.jl:1213
[19] top-level scope
@ none:1
[20] eval
@ ./boot.jl:360 [inlined]
[21] eval(x::Expr)
@ Base.MainInclude ./client.jl:446
[22] top-level scope
@ none:1
in expression starting at /home/dialvarezs/.julia/packages/XAM/9nk3g/src/sam/readrecord.jl:11
in expression starting at /home/dialvarezs/.julia/packages/XAM/9nk3g/src/sam/sam.jl:4
in expression starting at /home/dialvarezs/.julia/packages/XAM/9nk3g/src/XAM.jl:1
ERROR: Failed to precompile XAM [d759349c-bcba-11e9-07c2-5b90f8f05f7c] to /home/dialvarezs/.julia/compiled/v1.6/XAM/jl_nOmVNU.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::Base.TTY, internal_stdout::Base.TTY)
@ Base ./loading.jl:1360
[3] compilecache(pkg::Base.PkgId, path::String)
@ Base ./loading.jl:1306
[4] _require(pkg::Base.PkgId)
@ Base ./loading.jl:1021
[5] require(uuidkey::Base.PkgId)
@ Base ./loading.jl:914
[6] require(into::Module, mod::Symbol)
@ Base ./loading.jl:901
I'm trying to use the bioalignments sam reader to read the alignments in a file created by Minimap2. This file looks like a 'normal' sam file, but I still get the following error:
BioAlignments.SAM.Reader file format error on line 1536 ~>"\t*\n"
Line 1536 is the first line in the file that contains an alignment.
I am using the following code and the error occurs on line 4:
reader = SAM.Reader(open(samfile,"r"))
record = SAM.Record()
while !eof(reader)
read!(reader, record)
end
I'd like to use eachoverlap()
as the example in the documentation.
However, when the region is not available in a BAM file (e.g., special chromosome names like patches in human reference genome), I got an error (see below for details):
ERROR: MethodError: no method matching overlapchunks(::Indexes.BGZFIndex, ::Nothing, ::UnitRange{Int64})
Closest candidates are:
overlapchunks(::Indexes.BGZFIndex, ::Integer, ::UnitRange) at /Users/ozakiharuka/.julia/packages/Indexes/fj340/src/bgzfindex.jl:41
I think one way to handle such error is to check whether a BAM file contains the region before eachoverlap()
is called.
Does anybody know how to do this?
Test data (.bam, .bam.bai) is here
julia> using XAM
julia> using GenomicFeatures
julia> function print_overlapped_record(bam_file, chrom, leftpos, rightpos)
open(BAM.Reader, bam_file, index = bam_file * ".bai") do reader
for record in eachoverlap(reader, chrom, leftpos:rightpos)
print(record)
end
end
end
print_overlapped_record (generic function with 1 method)
julia> testfun(bam_file, "chr1", 3205900, 3616344)
XAM.BAM.Record:
template name: NS500723:19:H5LLWBGXX:2:13108:16341:13952
flag: 16
reference ID: 1
position: 3279555
mapping quality: 255
CIGAR: 75M
next reference ID: 0
next position: 0
template length: 0
sequence: ACCAAAGGTCCACCTTGCCTTCACCTTATTTGTCCTAAATTTCTTGCACACCGATATCAAAGGCCAGTCTGCAGC
base quality: UInt8[0x24, 0x24, 0x0e, 0x20, 0x24, 0x20, 0x24, 0x0e, 0x0e, 0x0e, 0x20, 0x24, 0x0e, 0x0e, 0x0e, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x0e, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x0e, 0x24, 0x0e, 0x0e, 0x24, 0x24, 0x20, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x20, 0x24, 0x24, 0x24, 0x24, 0x0e, 0x24, 0x24, 0x20, 0x15, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x20, 0x24, 0x24, 0x15, 0x24, 0x24, 0x20, 0x24, 0x24, 0x20, 0x20, 0x20, 0x20, 0x15]
auxiliary data: AS=-3 XN=0 XM=1 XO=0 XG=0 NM=1 MD=8C66 YT=UU NH=1XAM.BAM.Record:
template name: NS500723:19:H5LLWBGXX:2:12211:9394:16712
flag: 16
reference ID: 1
position: 3532296
mapping quality: 255
CIGAR: 52M
next reference ID: 0
next position: 0
template length: 0
sequence: TTTATCAAGTGTGGGTCGGAGCCAAGCAGCACGAACTGCAGGTGAACTCCAT
base quality: UInt8[0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x20, 0x20, 0x20, 0x20, 0x20]
auxiliary data: AS=-5 XN=0 XM=1 XO=0 XG=0 NM=1 MD=4A47 YT=UU NH=1XAM.BAM.Record:
template name: NS500723:19:H5LLWBGXX:3:13405:17228:11793
flag: 0
reference ID: 1
position: 3532529
mapping quality: 255
CIGAR: 75M
next reference ID: 0
next position: 0
template length: 0
sequence: GGGAACACGTCACCAGTCACCTACAACATGTTCGGACACTTCAAGTTCTGCATCACTCTCTGCGGAGGATACATT
base quality: UInt8[0x20, 0x20, 0x20, 0x20, 0x20, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24]
auxiliary data: AS=-5 XN=0 XM=1 XO=0 XG=0 NM=1 MD=7T67 YT=UU NH=1XAM.BAM.Record:
template name: NS500723:19:H5LLWBGXX:4:11504:6970:1796
flag: 16
reference ID: 1
position: 3532535
mapping quality: 255
CIGAR: 75M
next reference ID: 0
next position: 0
template length: 0
sequence: ACGTCACCAGTCACCTACAACATGTTCGGACACTTCAAGTTCTGCATCACTCTCTGCGGAGGATACATTCTATTT
base quality: UInt8[0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x20, 0x20, 0x20, 0x20, 0x20]
auxiliary data: AS=-5 ZS=-10 XN=0 XM=1 XO=0 XG=0 NM=1 MD=1T73 YT=UU NH=1XAM.BAM.Record:
template name: NS500723:19:H5LLWBGXX:1:13307:15032:13241
flag: 16
reference ID: 1
position: 3532548
mapping quality: 255
CIGAR: 39M
next reference ID: 0
next position: 0
template length: 0
sequence: CCTACAACATGTTCGGACACTTCAAGTTCTGCATCACTC
base quality: UInt8[0x0e, 0x24, 0x20, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x0e, 0x24, 0x24, 0x0e, 0x15, 0x24, 0x15, 0x24, 0x24, 0x20, 0x20, 0x20, 0x0e, 0x24, 0x20, 0x24, 0x24, 0x20, 0x24, 0x24, 0x24, 0x20, 0x20, 0x20, 0x20, 0x20]
auxiliary data: AS=0 ZS=-5 XN=0 XM=0 XO=0 XG=0 NM=0 MD=39 YT=UU NH=1
julia> testfun(bam_file, "chrZ", 3205900, 3516344)
ERROR: MethodError: no method matching overlapchunks(::Indexes.BGZFIndex, ::Nothing, ::UnitRange{Int64})
Closest candidates are:
overlapchunks(::Indexes.BGZFIndex, ::Integer, ::UnitRange) at /Users/ozakiharuka/.julia/packages/Indexes/fj340/src/bgzfindex.jl:41
Stacktrace:
[1] iterate(::XAM.BAM.OverlapIterator{IOStream}) at /Users/ozakiharuka/.julia/packages/XAM/ahh4D/src/bam/overlap.jl:54
[2] (::var"#9#10"{String,Int64,Int64})(::XAM.BAM.Reader{IOStream}) at ./REPL[25]:4
[3] #open#2(::Base.Iterators.Pairs{Symbol,String,Tuple{Symbol},NamedTuple{(:index,),Tuple{String}}}, ::typeof(open), ::var"#9#10"{String,Int64,Int64}, ::Type{XAM.BAM.Reader}, ::String) at /Users/ozakiharuka/.julia/packages/BioGenerics/cCuGr/src/IO.jl:48
[4] #open at ./none:0 [inlined]
[5] testfun(::String, ::String, ::Int64, ::Int64) at ./REPL[25]:3
[6] top-level scope at REPL[41]:1
When I ran a previous code I faced an error, since BAM.flag()
function no longer existed. I found these was other undocumented function XAM.flags()
. It seems XAM.flags()
behaves identically to the previous BAM.flag()
function after a try. Is that XAM.flags()
is just a rename of BAM.flag()
in the current and future versions, or is there another "formal" way I should use to access the flags data of read mapping information? Thank you!
When looping trough a BAM file with eachoverlap I get an error if the selected interval intersects past the last read of the BAM file.
samtools view -b chr1:1-10000 bigBAM.bam > small.bam
samtools index small.bam
reader = open(BAM.Reader, "small.bam", index="small.bam.bai")
for record in eachoverlap(reader, "chr1", 1:1000000)
# ...
# ...
end
close(reader)
Loops correctly trough all the reads of the small bam file.
The loop manages to read the last BAM record but then fails with a:
Stacktrace:
[1] getindex at ./array.jl:728 [inlined]
[2] virtualoffset(::BGZFStreams.BGZFStream{IOStream}) at /nfs/users/nfs_r/ra11/.julia/packages/BGZFStreams/qApsr/src/bgzfstream.jl:156
[3] iterate(::BioAlignments.BAM.OverlapIterator{IOStream}, ::BioAlignments.BAM.OverlapIteratorState) at /nfs/users/nfs_r/ra11/.julia/packages/BioAlignments/FOisL/src/bam/overlap.jl:65
[4] top-level scope at ./REPL[4]:2
Same as stated above. Take a section of a BAM and run
I don't know if this is a significant bottleneck, but I noticed some type instability when looping over BAM file records:
function read_bam(bam_file)
# iterate over BAM records
chr = String[]
left_pos = Int64[]
right_pos = Int64[]
for record in reader
push!(chr, BAM.refname(record))
push!(left_pos, leftposition(record))
push!(right_pos, rightposition(record))
end
DataFrame(chr = chr, left = left_pos, right = right_pos)
end
@code_warntype read_bam(some_file)
!julia> @code_warntype read_bam(bam_file)
Variables:
#self#
bam_file
record::ANY
#temp#::ANY
chr::Array{String,1}
left_pos::Array{Int64,1}
right_pos::Array{Int64,1}
itemT
Body:
begin
chr::Array{String,1} = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{String,1}, svec(Any, Int64), Array{String,1}, 0, 0, 0)) # line 4:
left_pos::Array{Int64,1} = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Int64,1}, svec(Any, Int64), Array{Int64,1}, 0, 0, 0)) # line 5:
right_pos::Array{Int64,1} = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Int64,1}, svec(Any, Int64), Array{Int64,1}, 0, 0, 0)) # line 6:
SSAValue(0) = Main.reader
#temp#::ANY = (Base.start)(SSAValue(0))::ANY
9:
unless !((Base.done)(SSAValue(0), #temp#::ANY)::ANY)::ANY goto 39
SSAValue(1) = (Base.next)(SSAValue(0), #temp#::ANY)::ANY
record::ANY = (Core.getfield)(SSAValue(1), 1)::ANY
#temp#::ANY = (Core.getfield)(SSAValue(1), 2)::ANY # line 7:
SSAValue(5) = ($(QuoteNode(BioAlignments.BAM.refname)))(record::ANY)::String
$(Expr(:inbounds, false))
# meta: location array.jl push! 652
SSAValue(6) = (Base.bitcast)(UInt64, (Base.check_top_bit)(1)::Int64)
$(Expr(:foreigncall, :(:jl_array_grow_end), Void, svec(Any, UInt64), :(chr), 0, SSAValue(6), 0)) # line 653:
# meta: location abstractarray.jl endof 134
# meta: location abstractarray.jl linearindices 99
# meta: location abstractarray.jl indices1 71
# meta: location abstractarray.jl indices 64
SSAValue(9) = (Base.arraysize)(chr::Array{String,1}, 1)::Int64
# meta: pop location
# meta: pop location
# meta: pop location
# meta: pop location
(Base.arrayset)(chr::Array{String,1}, SSAValue(5), (Base.select_value)((Base.slt_int)(SSAValue(9), 0)::Bool, 0, SSAValue(9))::Int64)::Array{String,1}
# meta: pop location
$(Expr(:inbounds, :pop)) # line 8:
(Main.push!)(left_pos::Array{Int64,1}, (Main.leftposition)(record::ANY)::ANY)::Array{Int64,1} # line 9:
(Main.push!)(right_pos::Array{Int64,1}, (Main.rightposition)(record::ANY)::ANY)::Array{Int64,1}
37:
goto 9
39: # line 11:
return $(Expr(:invoke, MethodInstance for (::Core.#kw#Type)(::Array{Any,1}, ::Type{DataFrames.DataFrame}), :($(QuoteNode(Type))), :(
end::DataFrames.DataFrame
I think the ANY type for record and temp indicate some type instability
Hi,
Thank you for building and maintaining this great package --- I wouldn't start analyzing genomic data via Julia without it!
I have a steps that I used to perform using pysam, but could not figure how they can be done via XAM:
Would you be able to guide me how this can be done?
Sincerely,
Daniel
Previously seen on Slack
It would be great if there was some way to convert an alignment into a SAM
/BAM.Record
so that it could be exported to the filesystem for future use. Note that I will only refer to SAM.Record
from here on, but the same would be expected for BAM.Record
, along the lines of #25.
Ideally, this would simply be a new method for the SAM.Record
constructor. Something like
XAM.SAM.Record(aln::PairwiseAlignment, kwargs...)
There would need to be other information (possibly with defaults?) added, as a SAM.Record
contains more fields than
There is no such behavior.
The challenge here is that SAM.Record
s store more information than PairwiseAlignment
s. Here's a map of what I think could be inferred vs. what needs to be passed as an argument.
Column | Field name | PairwiseAlignment accessor |
Possible default |
---|---|---|---|
1 | QNAME | N/A | N/A |
2 | FLAG | N/A* | N/A |
3 | RNAME | N/A | N/A |
4 | POS | N/A | 1** |
5 | MAPQ | unknown | 255 |
6 | CIGAR | cigar(aln.a.aln) | N/A |
7 | RNEXT | N/A | N/A*** |
8 | PNEXT | N/A | N/A*** |
9 | TLEN | length(aln.a.seq) | N/A |
10 | SEQ | aln.a.seq | N/A |
11 | QUAL | N/A | "*" |
12 | NM**** | count_mismatches(aln) | N/A |
It might be possible to compute some of the flags based on the sequence, but I don't know how to do so.
In my experience, BioAlignments.cigar
generates an alignment that always starts at position 1, and then pads it with deletions.
These fields are context-sensitive. Maybe generating records should be stateful 🤷?
Technically, the Number of Mismatches field is optional, but it is included in the recommended practices, and some tools fail if it isn't included.
I was working on generating very specific alignments for variant calling purposes, and found that I couldn't export the result to another tool. Having this functionality would remedy that.
Dear @CiaranOMara
I haven't lost track of the existing issues, and will get around to resolving them!
However, I got sidetracked with a more fundamental issue: The BGZFStreams package is slow, and not well integrated with the rest of the Julia IO ecosystem. Kenta Sato, the author of BGZFStreams.jl realized this, and created CodecBGZF.jl three years ago, but it looks like that project has been abandoned, and Kenta is a busy guy, hard to reach these days. See also this comment from Kenta Sato
In fact, BGZFStreams is the major bottleneck of XAM.jl, to the extent that optimizing XAM does not make any practical difference because it is so bogged down by BGZFStreams. For that reason, I created CodecBZFG myself. It's faster, safer, and more generic than BGZFStreams. I'd expect XAM's BAM module to be around 2x faster with CodecBGZF, perhaps more.
It'll be a bit of a project, since XAM depends on Indexes, which also depends on BGZFStreams.
I'll make this issue here, and then hopefully get around to it the following few months. If you support this change, and feel like giving it a crack, please be my guest! It'll also be useful to have other eyes on CodecBGZF to make sure I didn't bork the interface.
First, thanks for this package! I made some functions for creating objects from my GenomicVectors.jl package using your BAM reader functions. I think I found the right functions.
Chromosome names and lengths:
function GenomeInfo(name, reader::BioAlignments.BAM.Reader)
GenomeInfo(name,
reader.refseqnames,
reader.refseqlens
)
end
Chromosome name and alignment start and end positions:
read_bam = function(bam_file)
# iterate over BAM records
chr = String[]
left_pos = Int64[]
right_pos = Int64[]
for record in reader
push!(chr, BAM.refname(record))
push!(left_pos, leftposition(record))
push!(right_pos, rightposition(record))
end
DataFrame(chr = chr, left = left_pos, right = right_pos)
end
Should/could reader.refseqnames, reader.refseqlens and BAM.refname be exported? I'd be happy to add the exports, some docstrings and a bit of README text if you are open to it.
can you add some new functions to the whole BAM class? for examples,
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
Sometimes,We want to iter bam according to the position not record. So do you have the plan to add the function.
I've been working a bit with BAM files using BioAlignments the past days and have been looking through the source code. During the work, there was a few function I found that I missed from BioAlignments which would have my work easier. They would all be quite easy to implement, and I'd love to pitch a PR with the new interfaces + tests, but being interfaces, they're hard to change once released and very annoying to get wrong. So, let's discuss them first!
Currently, an AuxData
object can be created from the auxiliary data of a BAM record. In the current implementation It necessitates a copy of the data from the record, meaning that the user can manipulate the AuxData and seemingly succeeding, but without the record actually changing. Furthermore, most interfaces (all except iterate
, AFAIK) for aux fields is currently implemented to work on Vector{UInt8}
or on the records directly, making the AuxData type mostly obsolete.
I suggest removing the object alltogether to avoid code duplication of functions acting on AuxData and Records and confusion about what the user is mutating. If this is not acceptable, then making the AuxData view-based.
get(record::BAM.Record, key::AbstractString, default)
Records work sort of like AbstractDict
. In fact, the implemented AuxData
type is a subtype of AbstractDict. But asking for a value with a default is very awkward for the user. I suggest adding this method, which is to completely mirror how it works for Dicts.
delete!(record::BAM.Record, key::AbstractString)
Actually changing the auxiliary fields of BAM.Record objects is difficult. Again, this one should simply mirror the similar method for Dict
s. One could also think about adding pop!
.
setindex!(record::BAM.Record, key::AbstractString, value)
This would also be useful to have. The value would be encoded and appended to the end of the data array. Probably a few different methods would have to be written, since the type of Value should only be an UInt8, Int32, Float32, String, or Vector{T} where T can be UInt8, Int8, UInt16, Int16, UInt32, Int32 or Float32.
People can make all kinds of custom headers, but usually they just want the @SQ
headers updated plus whatever non-SQ headers the program created. It's worth noting that BioAlignments make no checks for the header and records matching together, gladly writing a healthy-looking but unparsable (by htslib) BAM.file, so making it easy to not mess up the headers is rather important. I suggest making it easy to update the SQ headers. I'd like any suggestion here.
I can imagine two new constructors for SAM.Header, namely SAM.Header(h::Sam.Header, records::Vector{BAM.Record})
and SAM.Header(records::Vector{BAM.Record})
, the latter creating a header with the appropriate "SQ" headers, the former with the same, but keeping any non-SQ tags already present.
Hi,
Could someone update Automa dependency ? XAM is restricted to 0.7, 0.8 meanwhile latest is 1.2.1. It's complicated to explain, but some packages that fix some bugs in latest version require latest Automa and this is conflicting with XAM requirements which are a bit out dated.
I'm trying to convert SAM record to a BAM one. I think I got the data layout roughly correct but the cigar and sequences need to be encoded differently or something. For context I now have paired read alignement working in BurrowsWheelerAligner.jl, so it would be possible to do a FASTQ to BAM & results directly in Julia. In general it would be nice to have more support for creating BAM alignment from scratch, modify them, etc. (https://discourse.julialang.org/t/change-record-sequence-in-a-bam-file-using-xam/98141/2) and I think converting SAM to BAM is a good starting point.
Any help would be appreciated.
sam = "seq1\t81\tPhiX\t1051\t60\t70M\t=\t1821\t702\tTCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGAC\t*\tNM:i:0\tMD:Z:70\tMC:Z:70M\tAS:i:70\tXS:i:0"
r = XAM.SAM.Record(sam)
cigar_ops = ('M','I','D','N','S','H','P','=','X')
br = BAM.Record()
# fixed-length fields (see BMA specs for the details)
br.refid = 1 #TODO : get from header
br.pos = SAM.position(r)
br.l_read_name = SAM.tempname(r) |> x -> length(x)+1 #why do I need + 1 ?
br.mapq = SAM.mappingquality(r)
#br.bin
br.n_cigar_op = sum(x ∈ cigar_ops for x in SAM.cigar(r))
br.flag = SAM.flag(r)
br.l_seq = SAM.seqlength(r)
br.next_refid = 1 #TODO : get from header
br.next_pos = SAM.nextposition(r)
br.tlen = SAM.templength(r)
# variable length data
t = x -> unsafe_wrap(Vector{UInt8}, x)
br.data = vcat(
SAM.tempname(r) |> t,
SAM.cigar(r) |> t,
SAM.sequence(String, r) |> t,
fill(0x41, length(SAM.sequence(r))),
#SAM.quality(r),
reduce(vcat, r.data[f] for f in r.fields) # aux data
)
br.block_size = length(br.data)
br
XAM.BAM.Record:
template name: seq1
flag: 81
reference ID: 2
position: 1052
mapping quality: 60
CIGAR: 70599891M
next reference ID: 2
next position: 1822
template length: 702
sequence: RGRGGVGVGMRGRGGMGMRGRGGVGMRGGVGVRGGMGAGVGARGRGGVGVRGGMGVRGGMRGRGGARGRG
base quality: UInt8[0x41, 0x43, 0x43, 0x41, 0x54, 0x54, 0x54, 0x43, 0x41, 0x41 … 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41]
auxiliary data: AA=A AA=A AA=A AA=A AA=A AA=A AA=A AA=AERROR: invalid type tag: 'M'
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] auxtype
@ ~/.julia/packages/XAM/l6LT2/src/bam/auxdata.jl:63 [inlined]
[3] loadauxtype
@ ~/.julia/packages/XAM/l6LT2/src/bam/auxdata.jl:82 [inlined]
Hi @CiaranOMara, I encountered this issue as well and I put together a BAM file that has two reads with the same position (one is the other's unmapped mate) for reproducing the issue. I find that the eof check fix (e.g. with feature/CodecBGZF-issue-31 branch) will actually skip one of the reads. See https://gist.github.com/Marlin-Na/9cb7767dfc87b007500a6fff4a8f5b9a
Originally posted by @Marlin-Na in #31 (comment)
I think my issue was that if a read is unmapped, its
rightposition(record)
will becomeposition(record) - 1
, which is >smaller than the leftposition. I think the check here https://github.com/BioJulia/XAM.jl/blob/develop/src/bam/overlap.jl#L87 >should account for this special case. Probably change it to the following? It worked for me.if rid < interval[1] || (rid == interval[1] && rightposition(record) < first(interval[2]) && position(record) < first(interval[2]))
and
if rid > interval[1] || (rid == interval[1] && position(record) > last(interval[2]) && rightposition(record) > last(interval[2]))
Originally posted by @Marlin-Na in #31 (comment)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.