Giter Site home page Giter Site logo

Comments (6)

holtjma avatar holtjma commented on May 25, 2024

Hi Wouter,

It looks like you ran into one of the older sanity checks I had in place during early development. As the panic indicates, during the WFA global-realignment process it iterated to an edit distance greater than 100 kb between the mapping and the reference graph. The tool is designed with assumptions based on PacBio HiFi datasets (length and accuracy) and the corresponding pre-processing (pbmm2, DeepVariant, etc.). An edit distance that large would indicate something is not matching our expectations for the input data. I can't say for sure whether that is a true high edit distance for the mapping, or if HiPhase has entered into some unexpected state due to the input data.

On the HiPhase side, we are not planning to remove this sanity check, but I will make a note to massage it into a cleaner error message instead of a difficult-to-interpret panic.

Matt

from hiphase.

wdecoster avatar wdecoster commented on May 25, 2024

Hi Matt,

Thanks for the explanation! Are you realigning individual reads or something else? If possible, could you edit the error to report the read or location that triggered this? Happy to take a closer look. For the record, this is R10.4.1 duplex data, so not necessarily terribly much different than HiFi, I guess. Could it be an unexpectedly long read?

Wouter

from hiphase.

holtjma avatar holtjma commented on May 25, 2024

Are you realigning individual reads or something else?

Yes, each mapping gets re-aligned to an ALT-aware localized reference during allele assignment. There's more details on the process in the pre-print and supplement: https://www.biorxiv.org/content/10.1101/2023.05.03.539241v1

If possible, could you edit the error to report the read or location that triggered this?

That's likely doable, but I need to look at what information is readily available when the check occurs before saying yes or no.

Could it be an unexpectedly long read?

Possibly? But it seems odd to me: My understanding of duplex is that most reads should be in the low 100s of kb, so hitting a 100 kb edit distance seems really high given that length (i.e., the empirical Q-score would be quite low relative to reported Q-scores). Either way, a long read with > 100 kb edit distance would definitely break some assumptions.

Matt

from hiphase.

holtjma avatar holtjma commented on May 25, 2024

@wdecoster

HiPhase v0.10.2 should provide a cleaner error message with the read name and position in the output, similar to this one where we lowered the threshold for testing:

...
[2023-06-26T20:04:54.252Z ERROR hiphase] Error while processing PhaseBlock { block_index: 18, coordinates: "chr1:2236919-2944365", num_variants: 1005, sample_name: "HG001" }:
[2023-06-26T20:04:54.252Z ERROR hiphase]   Encountered WFA error for mapping "m64109_200805_204709/161875013/ccs" (chr1:2648676): Max_edit_distance (4000) reached during WFA solving

If you don't mind, can you report back on what you find is causing it? We want to verify this is not something we should be expecting in edge cases for HiFi datasets under our supported upstream tooling.

from hiphase.

wdecoster avatar wdecoster commented on May 25, 2024

Hi Matt,

I got:
Encountered WFA error for mapping "88705c6b-329f-46fe-8921-9455ced2eb96" (chr21:6211371): Max_edit_distance (100000) reached during WFA solving
That is a locus with quite some segdups. Here are the offending alignments:
offending-read.sam.txt

Quite a long read, with a 173kb alignment in a nasty region... does this look unexpected?

Wouter

from hiphase.

holtjma avatar holtjma commented on May 25, 2024

Yea, this is definitely not something we're currently accounting for (nor are we expecting this from current HiFi tech/processing). Ignoring the read length, the cigar string has a high number of D/I entries ({'S': 24171, 'M': 142393, 'D': 23494, 'I': 7040} are the totals I see), and that's not accounting for mismatches which would make the edit distance even higher. Also comforting to see that it's happening in a noisy region, and not somewhere that is relatively clean.

Thanks for reporting back, it will be good to be aware of this issue if we decide to support additional data types deviating from the current HiFi expectations.

from hiphase.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.