a-ludi / djunctor Goto Github PK
View Code? Open in Web Editor NEWClose assembly gaps using long-reads with focus on correctness.
License: MIT License
Close assembly gaps using long-reads with focus on correctness.
License: MIT License
It is possible that a gap is reduced too much, ie. the resulting contigs overlap if correctly aligned. This has to be avoided at all costs because it makes downstream analysis much harder.
This is a follow up of issue #25.
There is a known bug with damapper
when it internally calls LAmerge
. The latter uses a false sorting order such that broken alignment chains may result. Details are missing, yet.
Currently, if required a Dazzler mask designating the assessed repeat structure will be written with a given name. Sadly, it is just placed in the working directory. There should be a possibility to designate a target location (relative to pwd
).
From issue #35:
Above it says:
[…] By identifying those regions of the read that map to more than one locus we can derive some repetitive regions on the reference. […] Secondly, should the involved portion of the read map to any other location on the reference then regions covered by that portion should be marked as repetitive as well. […]
Realizing a repeat detection by this technique is technically complex and might be achieved by a simpler approach: since all the involved alignments, resp. sub-regions thereof, map to the reference, these regions of the reference should map to themselves. Thus, by adjusting the error tolerance (
-e
switch for Dazzler tools) we should be able to observe the same regions.
The OverlayDB should be "cleaned" every once in a while to free up disk space.
The repeat mask derived from the self-alignment seems be a strict subset of the mask derived from the read alignment. Thus, we may skip the self-alignment saving computation resources and reducing complexity of the algorithm.
We can use reads that reach into gaps without spanning them to improve sequence of reads that span the same gap, ie. for a read R_a
extending contig C_x
and a read R_b
extending contig C_y
and C_z
(y != z
) add R_a
to the pileup of R_b
iff x = y
or x = z
.
At the moment, similarScore
just has some arbitrary thresholds for absolute and relative 'error'. It would be nice to have a theoretical foundation for this function:
┆Issue is synchronized with this Trello card
How will the consensus of two equally represented hablotypes look like?
Input:
tttccgccttacacaacacagtgttctgctattgcccagagggcactttt
tttgcatcaaactggaaacctagaaaaaatgggtgatttcctagtaattt
Consensus Result:
tttccg-ccttacacaacacagtgttctgctattgcccag-agggcactt
|||*||*|||||||||||||||||||*||||*|||||*||*|||||||||
tttgcgtccttacacaacacagtgtt-tgct-ttgcctagtagggcactt
tttgcatcaaactggaaacctagaaaaaatgggtgattt-cctagta----a-ttt
||||||||||*||*||*||**|*|*|*|*||**||*|||*||*||*|****|*|||
tttgcatcaa-cttgacac--a-acacagtgtttgctttgcccag-agggcacttt
Write unit tests for each module.
Issue by a-ludi
Thursday Mar 15, 2018 at 09:02 GMT
Originally opened as https://github.com/a-ludi/ddjunctor/issues/2
Maybe one can use machine learning to sort the reads into the three buckets?
Candidate Algorithms:
In order to compute the consensus of a pile up an alignment of the cropped reads must be computed. Ideally, we would want to use the information on the (relative) location of the cropped reads we gained during prior analysis and selection.
lasfilteralignments
to reduce faulty alignments.This is an idea for a near-natural testing scenario with known "ground truth". Its main benefits are: (1) near-natural challenges and (2) knowledge about the correctness of the result.
It takes these steps to generate to test data:
vX
and vY
, X < Y
of some genome.vY
to vX
.vY
and take vX
to be the ground truth.Assess the algorithm's result:
vY
.djunctor
should be tested with batches of randomized input data in order to discover malfunction that does not occur in static data tests by chance.
TODO define details
In order to apply the algorithm multiple times the user needs to provide intermediate results as input, one of which is the repeat mask. The The user input shall be used as the initial value of the repeat mask which then gets updated by the new alignments.
Until to now, gaps can successfully be filled. Now it is time to transfer the procedure to the extension case.
CoordinateTransform
so extensions can be addedDJunctor.getInsertionInfo
to the extension caseDJunctor.getExtensionSequenceSlice
analogue to DJunctor.getGapSequenceSlice
DJunctor.fillGap
DJunctor.extendContig
analogue to DJunctor.fillGap
Currently, there is no explicit filter stage that discards improper alignment but two stages: one ignoring them and the other implicitly removing them. These reads should be explicitly filtered before engaging into the other stages.
In order to thoroughly test the function of djunctor
, we need to have an automatic integration test suite. The suite should process some toy examples with known outcome. The tests should cover different scenarios from fully artificial and non-random to realistically simulated.
Previously, repeats were detected by analysing individual alignments – reference vs. reference and reference vs. reads (eg. issues #35 and #36). This can lead to problems with tricky data, as individual alignments may be faulty or may result from unexpected data. Consider these two scenarios:
The solution to this problem is to make coverage-based decisions, ie. if for some locus the alignment coverage gets out of some boundaries that locus should be considered "bad". If the coverage is much higher than expected the locus likely is repetitive; if the coverage is much lower than expected either the reads or the reference have bad quality.
For each both the ref vs. ref and ref vs. reads alignments do:
Since the loop shall be removed from the main algorithm, the result can be sequentially written to stdout
at the end. Thus, OverlayDB
only adds unnecessary complexity and shall be removed.
stdout
CoordinateTransform
OverlayDB
Write unit tests for:
It is important to account for low complexity regions as they produce ambiguous and unreliable alignments.
DBdust
The two criteria in the original issue #3 address two different concerns. Thus, they should be separated in two functions:
filterRedundantReads
: the aligned read, ie. the alignment extened with the exceeding read sequence on either end, is fully contained in a single contig.filterAmbiguouslyMappingReads
: It aligns in multiple locations of one contig with approximately equal quality.In order to have a clean implementation, some generalized filtering pipeline should be devised.
When acquiring the trace points for alignment chains a so-called fingerprint of the alignment chain is built which is used to identify it. Apparently, sometimes the fingerprint of the alignment chain in inverse-order (ref2reads vs. reads2ref) differs slightly from the original order fingerprint, eg.:
{
"originalOrderFingerprint": [
6, // Contig A ID
1824, // Contig B ID
8115, // Begin on contig A of first local alignment in chain
0, // Begin on contig B of first local alignment in chain
12750, // End on contig A of last local alignment in chain
4645 // End on contig B of last local alignment in chain
],
"inverseOrderFingerprint": [
6, // Contig A ID
1824, // Contig B ID
8178, // Begin on contig A of first local alignment in chain
0, // Begin on contig B of first local alignment in chain
12750, // End on contig A of last local alignment in chain
4645 // End on contig B of last local alignment in chain
]
}
LAdump
or rather an invalid assumption in djunctor
?Before building the consensus each read of a pile up should be cropped such that only the part reaching into the gap will be considered. Read sequence outside of this region will be only noise to the pairwise alignment of the reads.
In order to have a "clean" cutting point a trace point of the alignment common to all reads in the pile up should be used. The first/last such trace point should be used for a gap begin/end thus maximizing the common alignment.
Applying this algorithm iteratively can be done from outside if given some intermediate results. Thus, the loop should be eliminated and opportunities given to access the intermediate results.
// Initialization
maskBadRegions()
filterReadAlignments()
// Close Gaps
buildPileUps()
for each pileUp do
selectGoodReads()
cropAndValidateReads()
buildConsensus()
// Finish
pruneOverfilledGaps()
generateOutput()
Until now the workdir
needs to be set in module scope in order for the module to function. This places limitations when it comes to switching workdir.
Pass workdir explicitly for every function call. Thus, it will be easy to use a different workdir for each function call – if desired.
Currently, the CLI option --min-anchor-length controls how long a "anchor" must be at least to be considered unique. A sound value for this option should be derivable from the estimated reference and reads error rates given some confidence level (see also issues #26, #28 and #32).
If you wanto to update an Illumina assembly (or similar) then how much coverage should your new PacBio (or similar) reads have?
There should be an option to specify the expected error rate which is then used to calculate the -e
option to daligner
or damapper
.
When building pile ups only "valid" read alignments are considered. Sadly, a read spanning a whole contig and extending it is not "valid" as it has an ambiguous type (extension, spanning, …). At the same time, these reads are very valuable and must not be discarded.
insertionScore
is going to include a penalty for improper alignments and the split alignment will be improper.none
, extension
or span
. This solution has some implementation over-head as it requires adjustments in several places of the code but avoids the drawbacks of the first solution.Currently, a alignment/read is considered useless if:
Both criteria are far too harsh. Thus, I propose the following criteria:
It is important to account for repetitive regions as they produce ambiguous and unreliable alignments.
getComplementaryOrder
was originally built so one can fetch trace points in the "complementary order". For that only the beginning and end of the whole chain was needed. But now this function is being used to generate the working data and should be adjusted accordingly.
d'accord is a non hybrid long read consensus program based on local de Bruijn graph assembly. We use this tool to build a consensus out of a stack of alignments spanning/extending an assembly gap. The resulting sequence can then be used to enhance the gap at hand.
┆Issue is synchronized with this Trello card
Using the -C
option of damapper
produces to two LAS files [B-reads].[A-reads].las
and [A-reads].[B-reads].las
. Currently, both files are used for different purposes and a mapping between alignment chains of both must be made. This seems to generate unnecessary complexity, thus opening doors for errors.
djunctor.djunctor.AlignmentContainer
.Applying this algorithm iteratively can be done from outside if given some intermediate results. Thus, the loop should be eliminated. The loop is currently limited to one iteration anyway.
// Initialization
maskBadRegions()
filterReadAlignments()
// Close Gaps
buildPileUps()
for each pileUp do
selectGoodReads()
cropAndValidateReads()
buildConsensus()
// Finish
pruneOverfilledGaps()
generateOutput()
Reads will not be considered if they map to different locations on the reference with similar quality. This should take into account that sometimes only short regions of a read map to a different location (but with the same quality). This might be caused by short transposable elements (TE). Theoretically, this should be prohibited by masking repetitive regions. But, there are two reasons for this to fail:
The real alignment location should have a proper alignment. Proper alignments should be excluded in these two scenarios:
Short TEs will induce local alignments which are most certainly not proper. These alignments should (1) be used to further extend the repeat mask and (2) be discarded in further analysis if there is a "large enough"1 portion of the read that does not align to the reference – one might call this an "anti-anchor".
Thus, the first step in processing these alignments is to derive repetitive regions. Firstly, the region on the reference covered by the improper local alignment should be completely marked as repetitive. Secondly, should the involved portion of the read map to any other location on the reference then regions covered by that portion should be marked as repetitive as well.
Now, the second step is just the usual algorithm2 used to remove repeat-induced reads from the set of candidates.
The filtering stage must be restructured as follows:
procedure filterReads:
begin
assessRepeatStructure(selfAlignment) // issue #25
assessRepeatStructure(readsToReferenceAlignment) // see above
filterWeaklyAnchoredReads() // issue #25
filterAmbiguouslyMappingReads() // see above
filterRedundantReads() // issue #3, #34
end
see --min-anchor-length
and issue #33. ↩
When closing a gap several tasks need to be fulfilled:
The implementation of step (2) is a bit unclear: there are basically two options (a) operate on the FASTA sequence and (b) operate with the binary data of the dazzler tools.
The generated instructions in step (3) should be implemented as a struct
that is able to accumulate many simple transforms – as produced in step (3). The resulting transform should be able to convert coordinates at run time and produce a human-readable string representation. Bonus: the format of the string representation can be configured to yield a certain syntax, e.g. Python 2 lambda expression, JavaScript/D inline function, bash function, R inline function, ….
When writing the output FASTA each scaffold needs a header. How should this be handled, eg. copy orignal, generate from template?
Each candidate read, ie. not an "inner read", aligns to either the begin or the end of a reference contig. When building piles ups this has to be considered.
The "addressed end" of the reference contigs is ignored.
Pile ups contain only reads aligning to the same end(s) of the reference contigs.
From the simulator.c
header comment:
[…] If a read pair is say b,e then
* if b < e the read was sampled from [b,e] in the forward direction, and from [e,b]
* in the reverse direction otherwise.
The current scripts for identifying he true pile ups do not account for that.
Extensions with only a few bases are likely an artefact of imperfect alignments. Thus, they should be excluded.
Selecting a "good" set of reads for the filling process is crucial to the success of the project as the most important goal is correctness.
--reads-error
)Given these values
L ... #{reference bps covered by alignment chain}
t ... trace point distance
n ... either L if global or t if local
z ... parameter for the confidence interval; multiplier for σ
ε_reads ... error rate [#{errors}/base pair] of reads
ε_ref ... error rate [#{errors}/base pair] of reference
ε = (1 - ε_reads)(1 - ε_ref)
X_ε ... #{errors in n bps with ε #{errors}/base pair}
we can approximate the distribution of X_ε
μ = nε
σ² = nε(1 - ε)
X_ε ~ B(n, ε)
~= N(μ, σ²) | valid if t >= 9/ε(1 - ε)
When reading the trace points I check if the accumulated number of differences in the trace point list is the same as the amount given by the -d
option of LAdump
. Sometimes this condition does not hold.
I have attached a small script and data that shows the problem. Just execute in bash.
Add a CLI option to specify a file to write the set of filtered reads, ie. the presumably uniquely aligning reads, to. The output should be in form of a Dazzler DB.
The input to this algorithm should be a a set of scaffolds. But the dazzler tools will automatically split the input sequence if it entcounters N's. Thus, scaffolds will be broken into contigs.
This algorithm should preserve the scaffold structure of the input data, ie. if contig X is known to be just before contig then they should be joined in that order or left as is.
If this algorithm has strong evidence for an erroneous scaffold the information should be emitted such that it can be used for further processing.
The OverlayDB needs to be merged producing the final output right before the algorithm is finished.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.