Comments (6)
Hi Wouter,
It looks like you ran into one of the older sanity checks I had in place during early development. As the panic indicates, during the WFA global-realignment process it iterated to an edit distance greater than 100 kb between the mapping and the reference graph. The tool is designed with assumptions based on PacBio HiFi datasets (length and accuracy) and the corresponding pre-processing (pbmm2, DeepVariant, etc.). An edit distance that large would indicate something is not matching our expectations for the input data. I can't say for sure whether that is a true high edit distance for the mapping, or if HiPhase has entered into some unexpected state due to the input data.
On the HiPhase side, we are not planning to remove this sanity check, but I will make a note to massage it into a cleaner error message instead of a difficult-to-interpret panic.
Matt
from hiphase.
Hi Matt,
Thanks for the explanation! Are you realigning individual reads or something else? If possible, could you edit the error to report the read or location that triggered this? Happy to take a closer look. For the record, this is R10.4.1 duplex data, so not necessarily terribly much different than HiFi, I guess. Could it be an unexpectedly long read?
Wouter
from hiphase.
Are you realigning individual reads or something else?
Yes, each mapping gets re-aligned to an ALT-aware localized reference during allele assignment. There's more details on the process in the pre-print and supplement: https://www.biorxiv.org/content/10.1101/2023.05.03.539241v1
If possible, could you edit the error to report the read or location that triggered this?
That's likely doable, but I need to look at what information is readily available when the check occurs before saying yes or no.
Could it be an unexpectedly long read?
Possibly? But it seems odd to me: My understanding of duplex is that most reads should be in the low 100s of kb, so hitting a 100 kb edit distance seems really high given that length (i.e., the empirical Q-score would be quite low relative to reported Q-scores). Either way, a long read with > 100 kb edit distance would definitely break some assumptions.
Matt
from hiphase.
HiPhase v0.10.2 should provide a cleaner error message with the read name and position in the output, similar to this one where we lowered the threshold for testing:
...
[2023-06-26T20:04:54.252Z ERROR hiphase] Error while processing PhaseBlock { block_index: 18, coordinates: "chr1:2236919-2944365", num_variants: 1005, sample_name: "HG001" }:
[2023-06-26T20:04:54.252Z ERROR hiphase] Encountered WFA error for mapping "m64109_200805_204709/161875013/ccs" (chr1:2648676): Max_edit_distance (4000) reached during WFA solving
If you don't mind, can you report back on what you find is causing it? We want to verify this is not something we should be expecting in edge cases for HiFi datasets under our supported upstream tooling.
from hiphase.
Hi Matt,
I got:
Encountered WFA error for mapping "88705c6b-329f-46fe-8921-9455ced2eb96" (chr21:6211371): Max_edit_distance (100000) reached during WFA solving
That is a locus with quite some segdups. Here are the offending alignments:
offending-read.sam.txt
Quite a long read, with a 173kb alignment in a nasty region... does this look unexpected?
Wouter
from hiphase.
Yea, this is definitely not something we're currently accounting for (nor are we expecting this from current HiFi tech/processing). Ignoring the read length, the cigar string has a high number of D
/I
entries ({'S': 24171, 'M': 142393, 'D': 23494, 'I': 7040}
are the totals I see), and that's not accounting for mismatches which would make the edit distance even higher. Also comforting to see that it's happening in a noisy region, and not somewhere that is relatively clean.
Thanks for reporting back, it will be good to be aware of this issue if we decide to support additional data types deviating from the current HiFi expectations.
from hiphase.
Related Issues (20)
- Feature request: optional file containing read haplotype assignments HOT 1
- Question on tagging of supplemental alignments HOT 2
- Feature request: CRAM compatible HOT 6
- RUST error when phasing with SV VCF file HOT 7
- Poor utilization of threads (maybe user error?) HOT 6
- Error βthread '<unnamed>' panicked at 'assertion failed: `(left == right)`β occurred while HIPhase working HOT 7
- segmentation fault (core dumped) HOT 2
- Recommendations for input vcf HOT 3
- A question about HP tag HOT 6
- reference letter case issue HOT 2
- Feature request: haplotag in phased VCF files HOT 2
- Running HiPhase with tumor-normal pair HOT 2
- Error while parsing VCF file: FORMAT columns HOT 3
- Expected memory usage HOT 3
- Phase vcf with pre phased reads HOT 6
- Normalization of INDELs: required or should be avoided HOT 3
- [Suggestion] reducing messages to STDOUT to speed up the utility HOT 3
- [Question] information in the filter column of vcfs HOT 2
- [Question] Phasing of rs36056539 in NA19226 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hiphase.