a-ludi / djunctor Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 113.18 MB

Close assembly gaps using long-reads with focus on correctness.

License: MIT License

D 93.55% Shell 6.31% Makefile 0.14%

assembly-gaps pacbio

djunctor's People

Contributors

Watchers

djunctor's Issues

Eliminate negative gaps

It is possible that a gap is reduced too much, ie. the resulting contigs overlap if correctly aligned. This has to be avoided at all costs because it makes downstream analysis much harder.

This is a follow up of issue #25.

Potential internal bug of damapper

There is a known bug with damapper when it internally calls LAmerge. The latter uses a false sorting order such that broken alignment chains may result. Details are missing, yet.

Provide generated masks outside of the working directory

Currently, if required a Dazzler mask designating the assessed repeat structure will be written with a given name. Sadly, it is just placed in the working directory. There should be a possibility to designate a target location (relative to pwd).

Choose error tolerance for self-alignment such that all relevant repeats will be detected

From issue #35:

Above it says:

[…] By identifying those regions of the read that map to more than one locus we can derive some repetitive regions on the reference. […] Secondly, should the involved portion of the read map to any other location on the reference then regions covered by that portion should be marked as repetitive as well. […]

Realizing a repeat detection by this technique is technically complex and might be achieved by a simpler approach: since all the involved alignments, resp. sub-regions thereof, map to the reference, these regions of the reference should map to themselves. Thus, by adjusting the error tolerance (-e switch for Dazzler tools) we should be able to observe the same regions.

Clean Up OverlayDB

The OverlayDB should be "cleaned" every once in a while to free up disk space.

Tasks

Clean OvelayDB, ie. identify unused overlays and remove the associated files to free up disk space.

Is the self-alignment essential?

The repeat mask derived from the self-alignment seems be a strict subset of the mask derived from the read alignment. Thus, we may skip the self-alignment saving computation resources and reducing complexity of the algorithm.

Use extending reads to improve consensus for gaps

We can use reads that reach into gaps without spanning them to improve sequence of reads that span the same gap, ie. for a read R_a extending contig C_xand a read R_b extending contig C_y and C_z (y != z) add R_a to the pileup of R_b iff x = y or x = z.

Is there a theoretical approache for `similarScore`

At the moment, similarScore just has some arbitrary thresholds for absolute and relative 'error'. It would be nice to have a theoretical foundation for this function:

Is there a reasoning for specific values of the thresholds?
Is there another theoretically sound function?

┆Issue is synchronized with this Trello card

How does consensus handle hablotypic scenarios?

How will the consensus of two equally represented hablotypes look like?

Example:

Input:

tttccgccttacacaacacagtgttctgctattgcccagagggcactttt
tttgcatcaaactggaaacctagaaaaaatgggtgatttcctagtaattt

Consensus Result:


tttccg-ccttacacaacacagtgttctgctattgcccag-agggcactt
|||*||*|||||||||||||||||||*||||*|||||*||*|||||||||
tttgcgtccttacacaacacagtgtt-tgct-ttgcctagtagggcactt

tttgcatcaaactggaaacctagaaaaaatgggtgattt-cctagta----a-ttt
||||||||||*||*||*||**|*|*|*|*||**||*|||*||*||*|****|*|||
tttgcatcaa-cttgacac--a-acacagtgtttgctttgcccag-agggcacttt

complete unit tests

Write unit tests for each module.

Machine learning for classification?

Issue by a-ludi
Thursday Mar 15, 2018 at 09:02 GMT
Originally opened as https://github.com/a-ludi/ddjunctor/issues/2

Maybe one can use machine learning to sort the reads into the three buckets?

Candidate Algorithms:

Improve pre-consensus alignment

In order to compute the consensus of a pile up an alignment of the cropped reads must be computed. Ideally, we would want to use the information on the (relative) location of the cropped reads we gained during prior analysis and selection.

Tasks

Examine examples of the alignments in the current state.
Add example to integration tests which shows difficulties.
Try using lasfilteralignments to reduce faulty alignments.
Try adding unique (random) adapters to the cropped reads acting as anchors.

Testing: use two versions of a real genome

This is an idea for a near-natural testing scenario with known "ground truth". Its main benefits are: (1) near-natural challenges and (2) knowledge about the correctness of the result.

It takes these steps to generate to test data:

Take to versions vX and vY, X < Y of some genome.
Align vY to vX.
Assess filled gaps, extended gaps and new scaffolds.
Generate reads from vY and take vX to be the ground truth.

Assess the algorithm's result:

Align result to vY.
Assess filled gaps, extended gaps and new scaffolds. (step 3 above)
Compare with former information:
- Which gaps are correctly filled?
- Which extensions are have correct sequence? How large are they?
- Which scaffolding operations are correct?

Random batch tests

djunctor should be tested with batches of randomized input data in order to discover malfunction that does not occur in static data tests by chance.

TODO define details

Initialize repeat mask with user input

In order to apply the algorithm multiple times the user needs to provide intermediate results as input, one of which is the repeat mask. The The user input shall be used as the initial value of the repeat mask which then gets updated by the new alignments.

Transfer gap closing procedure to sequence extension

Until to now, gaps can successfully be filled. Now it is time to transfer the procedure to the extension case.

Tasks:

extend CoordinateTransform so extensions can be added
adjust DJunctor.getInsertionInfo to the extension case
create DJunctor.getExtensionSequenceSlice analogue to DJunctor.getGapSequenceSlice
refactor DJunctor.fillGap
implement DJunctor.extendContig analogue to DJunctor.fillGap

Select only proper reads as candidates

Currently, there is no explicit filter stage that discards improper alignment but two stages: one ignoring them and the other implicitly removing them. These reads should be explicitly filtered before engaging into the other stages.

Integration Test Suite

In order to thoroughly test the function of djunctor, we need to have an automatic integration test suite. The suite should process some toy examples with known outcome. The tests should cover different scenarios from fully artificial and non-random to realistically simulated.

Alignment coverage for detection of "bad" loci

Previously, repeats were detected by analysing individual alignments – reference vs. reference and reference vs. reads (eg. issues #35 and #36). This can lead to problems with tricky data, as individual alignments may be faulty or may result from unexpected data. Consider these two scenarios:

With issue #35 a repeat mask will be generated if just a single read has an improper alignment. If there is a faulty read that aligns by chance at some locus, then this locus will be marked as a repeat. But maybe all other reads align just fine in this locus and it is just unique sequence.
The assembly contains two versions of one contig – eg. due to hablotypes. They will map – more or less – front to end on each other. Thus, the current implementation will exclude these two contigs from further analysis completely!

The solution to this problem is to make coverage-based decisions, ie. if for some locus the alignment coverage gets out of some boundaries that locus should be considered "bad". If the coverage is much higher than expected the locus likely is repetitive; if the coverage is much lower than expected either the reads or the reference have bad quality.

Sketch of Implementation

For each both the ref vs. ref and ref vs. reads alignments do:

Build a coverage graph by counting the number of alignments per base. This should be done in a windowed fashion to save time and memory or in a form such that only a small portion of the graph is known at any point of time.
Determine a confidence interval for the coverage. At first, this may be just two parameters.
Check if the observed coverage is in the interval and emit a mask if not.

Favor direct sequential output

Since the loop shall be removed from the main algorithm, the result can be sequentially written to stdout at the end. Thus, OverlayDB only adds unnecessary complexity and shall be removed.

Tasks

In the "finish" step of the algorithm…
- determine a linear walk contigs taking the gap filling information and scaffolding information into account
- collect all extensions, gap insertions and cropped contigs
- write them to stdout
Remove CoordinateTransform
Remove OverlayDB

Write unit tests for `mainLoop`, `filterUseless`, `spanningGapSize`, `extensionSize`

Write unit tests for:

extensionSize
spanningGapSize
filterUseless
mainLoop

Mask low complexity regions

It is important to account for low complexity regions as they produce ambiguous and unreliable alignments.

Tasks

Mask low complexity regions using DBdust
Include test case that verifies the benefits of this measure

Separation of concerns: when to classify as `cat_useless`

The two criteria in the original issue #3 address two different concerns. Thus, they should be separated in two functions:

filterRedundantReads: the aligned read, ie. the alignment extened with the exceeding read sequence on either end, is fully contained in a single contig.
filterAmbiguouslyMappingReads: It aligns in multiple locations of one contig with approximately equal quality.

In order to have a clean implementation, some generalized filtering pipeline should be devised.

`LAdump`: Inverse-order alignment chains might differ from original order chains

When acquiring the trace points for alignment chains a so-called fingerprint of the alignment chain is built which is used to identify it. Apparently, sometimes the fingerprint of the alignment chain in inverse-order (ref2reads vs. reads2ref) differs slightly from the original order fingerprint, eg.:

{
    "originalOrderFingerprint": [
        6,     // Contig A ID
        1824,  // Contig B ID
        8115,  // Begin on contig A of first local alignment in chain
        0,     // Begin on contig B of first local alignment in chain
        12750, // End on contig A of last local alignment in chain
        4645   // End on contig B of last local alignment in chain
    ],
    "inverseOrderFingerprint": [
        6,     // Contig A ID
        1824,  // Contig B ID
        8178,  // Begin on contig A of first local alignment in chain
        0,     // Begin on contig B of first local alignment in chain
        12750, // End on contig A of last local alignment in chain
        4645   // End on contig B of last local alignment in chain
    ]
}

Tasks

Is this a bug in LAdump or rather an invalid assumption in djunctor?
~~Use fuzzy-matching as a quick & dirty fix. (see comment)~~

Crop pile up to gap

Before building the consensus each read of a pile up should be cropped such that only the part reaching into the gap will be considered. Read sequence outside of this region will be only noise to the pairwise alignment of the reads.

In order to have a "clean" cutting point a trace point of the alignment common to all reads in the pile up should be used. The first/last such trace point should be used for a gap begin/end thus maximizing the common alignment.

Move loop structure out of the main algorithm

Applying this algorithm iteratively can be done from outside if given some intermediate results. Thus, the loop should be eliminated and opportunities given to access the intermediate results.

What needs to be done?

adjust algorithmic structure as sketched below (issue #47)
make sequential output (issue #45)
add opportunity to output of the candidate set of reads (issue #41)
add opportunity to make use of user-input masks (and reads) (issue #48)

Sketch of algorithmic structure

// Initialization
maskBadRegions()
filterReadAlignments()
// Close Gaps
buildPileUps()
for each pileUp do
    selectGoodReads()
    cropAndValidateReads()
    buildConsensus()
// Finish
pruneOverfilledGaps()
generateOutput()

Fix handling of workdir in `dazzler.d`

Until now the workdir needs to be set in module scope in order for the module to function. This places limitations when it comes to switching workdir.

Goal

Pass workdir explicitly for every function call. Thus, it will be easy to use a different workdir for each function call – if desired.

How long must an anchor be to have high probability of being unique?

Currently, the CLI option --min-anchor-length controls how long a "anchor" must be at least to be considered unique. A sound value for this option should be derivable from the estimated reference and reads error rates given some confidence level (see also issues #26, #28 and #32).

How much coverage should you aim for?

If you wanto to update an Illumina assembly (or similar) then how much coverage should your new PacBio (or similar) reads have?

Make estimate read error rate configurable

There should be an option to specify the expected error rate which is then used to calculate the -e option to daligner or damapper.

Filtering for `isValid` discard good reads

When building pile ups only "valid" read alignments are considered. Sadly, a read spanning a whole contig and extending it is not "valid" as it has an ambiguous type (extension, spanning, …). At the same time, these reads are very valuable and must not be discarded.

Possible Solutions

Pre-process local alignments such that contig-spanning alignments are split into two non-spanning ones. This would be easiest to implement, but has drawbacks:
1. The insertionScore is going to include a penalty for improper alignments and the split alignment will be improper.
2. When fetching trace points for the alignments in a pile up the coordinates do not map to the original ones anymore. Thus, it will be impossible to attach the trace points.
Make the typing a pair with one type for the begin and end of a contig each which may be one of none, extension or span. This solution has some implementation over-head as it requires adjustments in several places of the code but avoids the drawbacks of the first solution.

Fix masking of repetitive regions

Currently, the masking is not working like it should be. Given a set of masked region R, the minimum anchor length l_min, the alignment region r(a) the current implementation filters the alignment a iff

But we expect it to compute

When to classify as `cat_useless`

Currently, a alignment/read is considered useless if:

The alignment is fully contained in a single contig.
It aligns in multiple locations of one contig.

Both criteria are far too harsh. Thus, I propose the following criteria:

The aligned read, ie. the alignment extened with the exceeding read sequence on either end, is fully contained in a single contig.
It aligns in multiple locations of one contig with approximately equal quality.

Assess repeat structure

It is important to account for repetitive regions as they produce ambiguous and unreliable alignments.

Tasks

Mask repetitive regions using the self-alignment information
Include test case that verifies the benefits of this measure

Weakly anchored alignment chains should not be used for pile ups

There are some alignments that get not filtered correctly. In the screenshot below you can see two examples that lead to misassembly of contig 7 and 9.

`getComplementaryOrder` should not combine all local alignments into one

getComplementaryOrder was originally built so one can fetch trace points in the "complementary order". For that only the beginning and end of the whole chain was needed. But now this function is being used to generate the working data and should be adjusted accordingly.

Use `daccord` to create consensus sequence for alignment stacks

d'accord is a non hybrid long read consensus program based on local de Bruijn graph assembly. We use this tool to build a consensus out of a stack of alignments spanning/extending an assembly gap. The resulting sequence can then be used to enhance the gap at hand.

┆Issue is synchronized with this Trello card

Avoid using two LAS files for the same alignment (`-C` option)

Using the -C option of damapper produces to two LAS files [B-reads].[A-reads].las and [A-reads].[B-reads].las. Currently, both files are used for different purposes and a mapping between alignment chains of both must be made. This seems to generate unnecessary complexity, thus opening doors for errors.

Tasks

Why are both files being used?
Refactor the whole code base as to avoid using both. A starting point is to remove djunctor.djunctor.AlignmentContainer.

Make main algorithm one-pass-only

Applying this algorithm iteratively can be done from outside if given some intermediate results. Thus, the loop should be eliminated. The loop is currently limited to one iteration anyway.

Desired algorithmic structure

// Initialization
maskBadRegions()
filterReadAlignments()
// Close Gaps
buildPileUps()
for each pileUp do
    selectGoodReads()
    cropAndValidateReads()
    buildConsensus()
// Finish
pruneOverfilledGaps()
generateOutput()

Do not exclude reads with short repeats included

Reads will not be considered if they map to different locations on the reference with similar quality. This should take into account that sometimes only short regions of a read map to a different location (but with the same quality). This might be caused by short transposable elements (TE). Theoretically, this should be prohibited by masking repetitive regions. But, there are two reasons for this to fail:

The TE has only one copy on the input reference sequence and one or more other copies in gap regions. Thus, reads from this regions will align to both, the correct location at the gap boundary and the repeated element on the reference.
While the repetitive regions are assessed before eliminating ambiguously aligning reads, we use that information after elimination of these reads.

Countermeasures

Proper Alignments

The real alignment location should have a proper alignment. Proper alignments should be excluded in these two scenarios:

Should the same read have further proper alignments on the same contig or on more than one other contig(s) then it should be excluded because it maps ambiguously. However, the involved loci are not necessarily repetitive region as different regions of the read may be involved. By identifying those regions of the read that map to more than one locus we can derive some repetitive regions on the reference.
The read has a proper alignment that is fully contained in a contig. These alignments will only deteriorate performance of further processing. Other alignments of the same read, in contrast, will be likely wrong, so we should exclude all of them.

Short TE-induced local alignments

Short TEs will induce local alignments which are most certainly not proper. These alignments should (1) be used to further extend the repeat mask and (2) be discarded in further analysis if there is a "large enough"¹ portion of the read that does not align to the reference – one might call this an "anti-anchor".

Thus, the first step in processing these alignments is to derive repetitive regions. Firstly, the region on the reference covered by the improper local alignment should be completely marked as repetitive. Secondly, should the involved portion of the read map to any other location on the reference then regions covered by that portion should be marked as repetitive as well.

Now, the second step is just the usual algorithm² used to remove repeat-induced reads from the set of candidates.

Algorithmic structure

The filtering stage must be restructured as follows:

procedure filterReads:
begin
    assessRepeatStructure(selfAlignment)  // issue #25
    assessRepeatStructure(readsToReferenceAlignment)  // see above

    filterWeaklyAnchoredReads()  // issue #25
    filterAmbiguouslyMappingReads()  // see above
    filterRedundantReads()  // issue #3, #34
end

see --min-anchor-length and issue #33. ↩
see issue #25. ↩

Fill a gap with consensus sequence

When closing a gap several tasks need to be fulfilled:

Identify the correct subsequence of the consensus as well as the insertion points on the involved contigs.
Insert the sequence.
Generate instructions to transform old to new coordinates.

Technical Details:

The implementation of step (2) is a bit unclear: there are basically two options (a) operate on the FASTA sequence and (b) operate with the binary data of the dazzler tools.

The generated instructions in step (3) should be implemented as a struct that is able to accumulate many simple transforms – as produced in step (3). The resulting transform should be able to convert coordinates at run time and produce a human-readable string representation. Bonus: the format of the string representation can be configured to yield a certain syntax, e.g. Python 2 lambda expression, JavaScript/D inline function, bash function, R inline function, ….

How to handle scaffold headers?

When writing the output FASTA each scaffold needs a header. How should this be handled, eg. copy orignal, generate from template?

Consider "addressed end" of reference contigs

Each candidate read, ie. not an "inner read", aligns to either the begin or the end of a reference contig. When building piles ups this has to be considered.

Current State

The "addressed end" of the reference contigs is ignored.

Expected

Pile ups contain only reads aligning to the same end(s) of the reference contigs.

Collection of true pile ups might be faulty

From the simulator.c header comment:

[…] If a read pair is say b,e then
* if b < e the read was sampled from [b,e] in the forward direction, and from [e,b]
* in the reverse direction otherwise.

The current scripts for identifying he true pile ups do not account for that.

Require minimum extension length

Extensions with only a few bases are likely an artefact of imperfect alignments. Thus, they should be excluded.

Tasks

Calculate confidence interval for displacement due to given error rates of reference and reads.
Do not insert extensions that are improbable short after consensus.

Select "good" reads

Selecting a "good" set of reads for the filling process is crucial to the success of the project as the most important goal is correctness.

Tasks

Reads in a pile up should have…
1. a long “anchor” sequence, ie. presumable unique sequence
2. little overall error in the local alignments
3. no improbable high local (per trace point) error rate
4. ~~no improbable high global error rate~~ (implicit with --reads-error)

Confidence intervals of local error rate

Given these values

L       ... #{reference bps covered by alignment chain}
t       ... trace point distance
n       ... either L if global or t if local
z       ... parameter for the confidence interval; multiplier for σ
ε_reads ... error rate [#{errors}/base pair] of reads
ε_ref   ... error rate [#{errors}/base pair] of reference

ε = (1 - ε_reads)(1 - ε_ref)

X_ε     ... #{errors in n bps with ε #{errors}/base pair}

we can approximate the distribution of X_ε

μ    = nε
σ²   = nε(1 - ε)
X_ε  ~ B(n, ε)
    ~= N(μ, σ²)  | valid if t >= 9/ε(1 - ε)

`LAdump`: total number of diffs is not sum of trace point diffs

When reading the trace points I check if the accumulated number of differences in the trace point list is the same as the amount given by the -d option of LAdump. Sometimes this condition does not hold.

I have attached a small script and data that shows the problem. Just execute in bash.

Option to output the set of filtered reads

Add a CLI option to specify a file to write the set of filtered reads, ie. the presumably uniquely aligning reads, to. The output should be in form of a Dazzler DB.

Preserve input scaffolds

The input to this algorithm should be a a set of scaffolds. But the dazzler tools will automatically split the input sequence if it entcounters N's. Thus, scaffolds will be broken into contigs.

Expected Behaviour

This algorithm should preserve the scaffold structure of the input data, ie. if contig X is known to be just before contig then they should be joined in that order or left as is.

Bonus

If this algorithm has strong evidence for an erroneous scaffold the information should be emitted such that it can be used for further processing.

Merge OverlayDB

The OverlayDB needs to be merged producing the final output right before the algorithm is finished.

Tasks

Merge OverlayDB, ie. produce a single DB file which contains all the contigs being managed by the OverlayDB. The order of the entries should match the contig IDs.

a-ludi / djunctor Goto Github PK

djunctor's People

Contributors

Watchers

djunctor's Issues

Tasks

Example:

Tasks

Tasks:

Sketch of Implementation

Tasks

Tasks

Tasks

What needs to be done?

Sketch of algorithmic structure

Goal

Possible Solutions

Tasks

Tasks

Desired algorithmic structure

Countermeasures

Proper Alignments

Short TE-induced local alignments

Algorithmic structure

Footnotes

Technical Details:

Current State

Expected

Tasks

Tasks

Confidence intervals of local error rate

Expected Behaviour

Bonus

Tasks

Recommend Projects

Recommend Topics

Recommend Org