Dear <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

Various spec violations about xam.jl HOT 3 OPEN

biojulia commented on June 7, 2024

Various spec violations

from xam.jl.

Comments (3)

CiaranOMara commented on June 7, 2024

Making one PR for each would be overwhelming, especially since fixing some of them has implications that are not straightforward.

Let's split this up then.

Hotfix PRs addressing errors take priority. Hotfix PRs would be against the master branch and would embrace the current design flaws. Also, PRs addressing inconsistencies could be hotfixes too provided they don't alter the existing public API.

My preference is to address all known errors before making breaking changes (v0.3).

We should just get rid of unfilled records, they make little sense as a concept. Instead, we initialise every record as an empty, but valid record. This way, we never need to check if the record is filled, since it always is.

It'd be great to get rid of that isfilled check. The balance in question here is whether to use try catch, some other truthy toggle, or a use pattern that ensures the record is valid or that there was a conscious decision to use the record regardless.

The direction I was thinking is a two-stage approach, which would facilitate error recovery (https://github.com/BioJulia/XAM.jl/projects/4#card-42156212). The first stage would fill the record by copying in the current line. Then the second stage would attempt to index the copied data then return a truthy value.
With this approach, the file buffers and pointers will remain intact to process the next line if encountering an error or inconsistency while indexing the copied data. The state machine will greatly simplify as it would separate into three machines, the header-line, record-line and line structure.

I'm not sure what the performance implication would be with the two-stage approach as it parses the line twice? I'd be keen to hear your ideas about error recovery.

Rewrite the memory layout of the BAM record so it exactly matches the specifications.

I agree with rewriting the memory layout -- that's been bugging me. I also agree with your proposed way of consolidating the accessor and has functions. I think we should use BioGenerics.Exceptions.MissingFieldException.

PRs with developmental or breaking changes would be against the develop branch.

from xam.jl.

jakobnissen commented on June 7, 2024

My preference is to address all known errors before making breaking changes (v0.3).

Right, makes sense. I'll begin with that

The balance in question here is whether to use try catch, some other truthy toggle, or a use pattern that ensures the record is valid or that there was a conscious decision to use the record regardless.

I think we should enforce specification-compliant BAM/SAM files, in the sense that, if the user provides a noncompliant file, we ought to error. Silent errors are the worst, especially in a scientific setting. I'm not sure there's a lot of sense in allowing bad records to be used.

In particular, I have rather strong opinions about #23 - this is an error in HiSat2, and should be fixed there. If we begin correcting errors created by bad writers, we have a neverending problem on our hands.

Nonetheless, having the internal implementation be first decompressing the data, then copying a single record to a Vector{UInt8} buffer, then using that buffer to construct a valid BAM Record, is a good idea. I think that's also how it works right now, isn't it? If a user wants to use an invalid record (again, which I don't think is something we should design for allowing), they may add a new method that creates a Record from the vector. The performance will be fine.

I'm quite ambivalent about how to handle missing data. I think there are two good options:

Return nothing
Raise a BioGenerics.Exceptions.MissingFieldException.
I think returning a missing value is NOT a good idea, because missing is largely useless when processing SAM/BAM files, and may lead to unexpected errors. Probably a MissingFieldException is the way to go.

from xam.jl.

CiaranOMara commented on June 7, 2024

I think we should enforce specification-compliant BAM/SAM files, in the sense that, if the user provides a noncompliant file, we ought to error.

I mostly agree. I think as a default we should expect specification-compliant BAM/SAM files and error when there is non-compliance. But, I think we only need to enforce this expectation at the line-level, not holistically, so that when it comes to iteration, users can make a judgement call and opt to skip errors.

Silent errors are the worst, especially in a scientific setting.

I agree we should error.

I'm not sure there's a lot of sense in allowing bad records to be used.

Yes, a poor choice of words on my part. The record would probably be unusable as well as indicative of other issues relating to the quality of the file.

What I was thinking is that the data should be accessible for diagnosis and easy to acquire. For example, after an error, we should be able to seek position, then read erroneous data into Record or MetaInfo.

In particular, I have rather strong opinions about #23 - this is an error in HiSat2, and should be fixed there.

I agree with that too and have no intention of merging the workaround. For me, the issue was that XAM in its current state, was not able to work with or around an error that was judged trivial.

Nonetheless, having the internal implementation be first decompressing the data, then copying a single record to a Vector{UInt8} buffer, then using that buffer to construct a valid BAM Record, is a good idea. I think that's also how it works right now, isn't it?

In the current release, and certainly with SAM, indexing of the data occurs as the line/record is determined. In a sense, the record already has the indexes before the raw data gets copied into it. Also, we do not explicitly check successful indexing; we assume indexing was successful if the record fills without error. I'd need to look at BAM again to see what it does with records, but the parsing of the BAM header is the same as SAM.

In the context of skipping errors, was proposing to have both index-then-fill and fill-then-index schemes. Though perhaps a cleaner approach would be to have Automa seek the end of the line/record if it falls into an invalid state and then returns a flag for error handling whether that be a throw or skip. With this approach, the pointer appropriately positions for the next line/record, and there is only a single pass over the data.

I'm quite ambivalent about how to handle missing data.

I'd forgotten about Missing. When we last spoke about it, I thought that missing could provide an additional statistic as well as neaten the accessors.

from xam.jl.

Various spec violations about xam.jl HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent