Comments (3)
Making one PR for each would be overwhelming, especially since fixing some of them has implications that are not straightforward.
Let's split this up then.
Hotfix PRs addressing errors take priority. Hotfix PRs would be against the master branch and would embrace the current design flaws. Also, PRs addressing inconsistencies could be hotfixes too provided they don't alter the existing public API.
My preference is to address all known errors before making breaking changes (v0.3).
We should just get rid of unfilled records, they make little sense as a concept. Instead, we initialise every record as an empty, but valid record. This way, we never need to check if the record is filled, since it always is.
It'd be great to get rid of that isfilled
check. The balance in question here is whether to use try catch
, some other truthy toggle, or a use pattern that ensures the record is valid or that there was a conscious decision to use the record regardless.
The direction I was thinking is a two-stage approach, which would facilitate error recovery (https://github.com/BioJulia/XAM.jl/projects/4#card-42156212). The first stage would fill the record by copying in the current line. Then the second stage would attempt to index the copied data then return a truthy value.
With this approach, the file buffers and pointers will remain intact to process the next line if encountering an error or inconsistency while indexing the copied data. The state machine will greatly simplify as it would separate into three machines, the header-line, record-line and line structure.
I'm not sure what the performance implication would be with the two-stage approach as it parses the line twice? I'd be keen to hear your ideas about error recovery.
Rewrite the memory layout of the BAM record so it exactly matches the specifications.
I agree with rewriting the memory layout -- that's been bugging me. I also agree with your proposed way of consolidating the accessor and has functions. I think we should use BioGenerics.Exceptions.MissingFieldException
.
PRs with developmental or breaking changes would be against the develop branch.
from xam.jl.
My preference is to address all known errors before making breaking changes (v0.3).
Right, makes sense. I'll begin with that
The balance in question here is whether to use try catch, some other truthy toggle, or a use pattern that ensures the record is valid or that there was a conscious decision to use the record regardless.
I think we should enforce specification-compliant BAM/SAM files, in the sense that, if the user provides a noncompliant file, we ought to error. Silent errors are the worst, especially in a scientific setting. I'm not sure there's a lot of sense in allowing bad records to be used.
In particular, I have rather strong opinions about #23 - this is an error in HiSat2, and should be fixed there. If we begin correcting errors created by bad writers, we have a neverending problem on our hands.
Nonetheless, having the internal implementation be first decompressing the data, then copying a single record to a Vector{UInt8}
buffer, then using that buffer to construct a valid BAM Record
, is a good idea. I think that's also how it works right now, isn't it? If a user wants to use an invalid record (again, which I don't think is something we should design for allowing), they may add a new method that creates a Record from the vector. The performance will be fine.
I'm quite ambivalent about how to handle missing data. I think there are two good options:
- Return
nothing
- Raise a
BioGenerics.Exceptions.MissingFieldException
.
I think returning amissing
value is NOT a good idea, becausemissing
is largely useless when processing SAM/BAM files, and may lead to unexpected errors. Probably a MissingFieldException is the way to go.
from xam.jl.
I think we should enforce specification-compliant BAM/SAM files, in the sense that, if the user provides a noncompliant file, we ought to error.
I mostly agree. I think as a default we should expect specification-compliant BAM/SAM files and error when there is non-compliance. But, I think we only need to enforce this expectation at the line-level, not holistically, so that when it comes to iteration, users can make a judgement call and opt to skip errors.
Silent errors are the worst, especially in a scientific setting.
I agree we should error.
I'm not sure there's a lot of sense in allowing bad records to be used.
Yes, a poor choice of words on my part. The record would probably be unusable as well as indicative of other issues relating to the quality of the file.
What I was thinking is that the data should be accessible for diagnosis and easy to acquire. For example, after an error, we should be able to seek position, then read erroneous data into Record
or MetaInfo
.
In particular, I have rather strong opinions about #23 - this is an error in HiSat2, and should be fixed there.
I agree with that too and have no intention of merging the workaround. For me, the issue was that XAM in its current state, was not able to work with or around an error that was judged trivial.
Nonetheless, having the internal implementation be first decompressing the data, then copying a single record to a Vector{UInt8} buffer, then using that buffer to construct a valid BAM Record, is a good idea. I think that's also how it works right now, isn't it?
In the current release, and certainly with SAM, indexing of the data occurs as the line/record is determined. In a sense, the record already has the indexes before the raw data gets copied into it. Also, we do not explicitly check successful indexing; we assume indexing was successful if the record fills without error. I'd need to look at BAM again to see what it does with records, but the parsing of the BAM header is the same as SAM.
In the context of skipping errors, was proposing to have both index-then-fill and fill-then-index schemes. Though perhaps a cleaner approach would be to have Automa seek the end of the line/record if it falls into an invalid state and then returns a flag for error handling whether that be a throw or skip. With this approach, the pointer appropriately positions for the next line/record, and there is only a single pass over the data.
I'm quite ambivalent about how to handle missing data.
I'd forgotten about Missing
. When we last spoke about it, I thought that missing
could provide an additional statistic as well as neaten the accessors.
from xam.jl.
Related Issues (20)
- BoundsError in eachoverlap() HOT 14
- Browsing the documentation HOT 1
- Broken with Automa 0.8.1 HOT 5
- TagBot trigger issue HOT 5
- can you add some functions to BAM.Reader? HOT 2
- Position of unmapped mate
- Install ERROR: Unsatisfiable requirements detected for package BioSequences HOT 2
- Make `seqname` consistent with BioJulia ecosystem
- how to use @threads to use multithreads when reading BAM file HOT 2
- Feature Request: Conversion from `BioAlignments` to `SAM.Record`
- Finding a map between observed bases and positions in the reference genome HOT 4
- eachoverlap running past end of stream
- Add `index!(reader::Reader, path)`
- Do you have the plan to add the pileup functions? HOT 8
- Switch to Oneflow HOT 3
- Correct way to add records to an array HOT 1
- Add eachoverlap(reader::Reader, refname::String)
- while loop got eof error HOT 1
- [WIP] convert SAM to BAM HOT 1
- XAM compatibilty for Automa very old, please update?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xam.jl.