kevlar-dev / kevlar Goto Github PK

View Code? Open in Web Editor NEW

40.0 5.0 9.0 73.19 MB

Reference-free variant discovery in large eukaryotic genomes

Home Page: https://kevlar.readthedocs.io

License: MIT License

Python 94.52% Makefile 0.50% C++ 3.18% Shell 0.87% C 0.64% Dockerfile 0.29%

genomics genome-variation dna variant-calling

kevlar's Issues

Print recalculated abundances in "kevlar filter"

Currently, interesting k-mer abundances are recomputed by "kevlar filter", but these are not updated in the output. This should be addressed.

Add option to flush file handle in `kevlar find`

On Davis HPC, we've previously had (what we suspect are) issues with the system buffering output from Python scripts. I thought the problem would be restricted to stdout, but it seems to be happening even with handles to actual files on disk.

Flushing the file handle after each write takes more time, so we don't want it to be the default behavior. But we can definitely add it as an option when memory consumption is exorbitant.

Investigate second abundance re-calculation in kevlar filter

The kevlar filter command already recomputes k-mer abundances and discards k-mers whose correct abundance doesn't satisfy the threshold. Any read with no remaining "interesting" k-mers is discarded.

I was surprised today to find that running the same filter a second time will discard even more reads (no changes for third and subsequent attempts). At first I thought this might be a kind of sweep effect, in which some k-mers only reached the threshold in the first place because they were in reads that had a k-mer with a highly inflated abundance. But that doesn't make sense--if an interesting k-mer is discarded but the read has another, the read should not be discarded.

I need to investigate this further, and make sure the filter is doing the right thing. If it is, kevlar filter should probably do both rounds of filtering with a single invocation of the command.

Autoindex for kevlar localize

Currently kevlar localize relies on the reference genome having already been processed by bwa index. It shouldn't be too hard to autodetect whether the index files exist, and if they do not, to invoke bwa index programmatically from Python.

Interesting k-mer offsets are incorrect in "kevlar assemble" output

Here is an example.

>contig4;cc=1
AAGGTGAAAGCAGAAAGTCAGTTGTCCATATGGCTTGGGGAGATAAAGAAGGCCCAGGAAGGCCTCCAGGAAAAGGCTGCCATGTCAGGCAGGACACAGAGGGCAATTGAGGAAAAGTGATTCTTACAAGATGGTGAAGGTGCCATTGTGGGTGTT
                                                                                CAGGCAGGACACAGAGGGCAATTGAGGAAAA          18 0 0#
                                                                                  GGCAGGACACAGAGGGCAATTGAGGAAAAGT          18 0 0#
                                                                                         CACAGAGGGCAATTGAGGAAAAGTGATTCTT          18 0 0#
                                                                                          ACAGAGGGCAATTGAGGAAAAGTGATTCTTA          17 0 0#
                                                                                           CAGAGGGCAATTGAGGAAAAGTGATTCTTAC          17 0 0#
                                                                                            AGAGGGCAATTGAGGAAAAGTGATTCTTACA          17 0 0#
                                                                                              AGGGCAATTGAGGAAAAGTGATTCTTACAAG          16 0 0#
                                                                                                 GCAATTGAGGAAAAGTGATTCTTACAAGATG          17 0 0#

Track read IDs with a bloom filter in kevlar collect

Suggestion per @ctb: see #36 (comment).

Investigate making partitioning deterministic

The kevlar partition code constructs a graph of shared interesting k-mers, the connected components of which are distinct "partitions" which are handled separately in subsequent processing. The data structure used to sort the graph and the function used to return the connected components doesn't return CCs in a deterministic order, so running the partition code multiple times on the same input gives different output (well, presumably the same output, but in different orders).

Although it's not scientifically or even technically important, for the sake of evaluation and rapid iterative development it might help to have the partitions reported in a deterministic order.

Test impact of memory size on performance

I've always been thinking in terms of "more memory is always better because lower FPR". This is true, but there's evidence, at least for some computations, that runtime scales with memory consumption. So there's an optimization to be made here: what is the smallest amount of memory I can request that won't lead to big problems with FPR?

If this observation holds and if @ctb and I aren't completely of our rockers, we should actually get a speedup when we parallelize (as in #15). If we run all the parallel processes at once, then the total memory consumption will still be the same, but we would have the option to split it up.

Fix problems with tests

Coverage not measured correctly with CI build (probably related to pip install . vs PYTHONPATH)
Move tests into package as a subpackage, make runnable with a pip install

Consider the following: big differences in sequencing depth

Another comment from @drtamermansour: if the samples have different sequencing depths, you might expect to see some k-mers in high abundance in some samples and absent in other samples simply due to chance. Are there circumstances where we need to normalize by depth of sequencing?

Also, if library complexity is lower in one sample than another, the effective depth will be higher even if the same number of reads are generated. Tamer has seen this in RNA-seq samples, is this a problem with WGS libraries? If so, how do you measure it?

Write k-mers to banded counttables in a single pass

Suggestion from @drtamermansour: in a single pass, write count tables to N files (one for each band) in a single pass. Then running kevlar find in N bands would not require N passes over the entire data set, just loading the count tables from disk N times.

I just wanted to capture this suggestion, I have some concerns and I'm not sure it would yield much benefit.

Populating count tables from Fastq reads is only a small portion of the overall runtime of kevlar find. Loading from count table files rather than the Fastq files again probably won't make a huge difference in overall runtime.
We have^[1] to do a second pass over the case reads anyway, and this pass is the big time consumer. Any optimizations we wanted to do in the future would probably benefit more from focusing on this rather than the banded loading.

And in any case, this is all optimization: there's still work to do to get reliable results first!

^[1] There are ways we could investigate to do this in a streaming fashion, but for now I'm happy with saying we have to do a second pass over the reads. :-)

Picking values of k

In addition to picking appropriate memory sizes, there's also the question of picking an appropriate value of k. For the human stuff we're currently limited by khmer's maximum of 32, but that should be solved very soon.

Add input file type auto-detect for count[graph|table]

I'm wasting a lot of development time loading and re-loading Fastq data into counting data structures. Having the convenience of supporting Fasta/Fastq files as input is a priority, but I desperately need to take advantage of pre-computed countgraphs/counttables when they are available.

I propose the following behavior, which I think is 1) fairly reasonable from a user's perspective and 2) fairly straightforward to implement.

For data files to be loaded into a counting data structure, check the suffix.
- If the suffix is .cg or .countgraph, load the file with khmer.load_countgraph().
- If the suffix is .ct or .counttable, load the file with khmer.load_counttable().
- Otherwise, assume the file is Fast[a|q]; create a counttable ct and load with ct.consume_seqfile().

Add alignment score options to "kevlar call" CLI

In case one needs to adjust match / mismatch scores or gap open / extend penalties.

Write docs

in docs/ directory of master branch (see this)
mkdocs – cinder theme

Investigate sorting edges by combined degree of adjacent edges

In the current assembly code, edges in the "shared interesting k-mers" graph are processed in order of overlap: the largest overlaps first. I should investigate processing instead by node degree: high-degree nodes are less likely to lead to a premature truncation of the contig due to a sequencing error.

Migrate to dib-lab/kevlar

cc @ctb

lab branding
organizations >> individuals for sake of longevity

Will migrate once open PR(s) are merged.

Additional output for `find`

The find-novel-kmers.py script (soon to be kevlar find) currently supports three types of output.

augmented Fastq (read plus "interesting" k-mers in context; can get plain Fastq by grepping out lines ending with #)
"interesting" k-mers in Fasta format
associated graph linear paths in Fasta format

I propose that enable creation of 3 additional output files, all tabular.

one with a single k-mer in the first column, followed by a column of all associated read IDs (comma-separated), followed by a column of all associated linear path IDs (comma-separated)
one with a single read ID in the in the first column, followed by k-mers and then linear paths
one with a single linear path ID, followed by reads and then k-mers

All of the information is already there, but this will make it much more searchable with standard shell commands.

Abundance filtering

We're seeing some strange contamination in the 12175 SSC trio. This morning we discussed a read filtering strategy. While checking each read for "interesting" k-mers, if any k-mer has an abundance below some threshold \tau then the entire read should be discarded. This should be fairly easy to implement as an option, and hopefully shouldn't slow down performance too much. If anything, it might improve performance by discarding reads before all k-mers have been checked.

Fill out docstrings

There is a lot of room for improving code documentation with docstrings.

Profiling results

At the moment kevlar is painfully slow. We've been philosophizing for a while now whether this was more likely due to the poor cache locality of khmer's Count-Min Sketch implementation or to slow Fastq parsing/handling, or some other factor or combination of factors. It's time to evaluate this empirically.

I profiled both kevlar count and kevlar novel on the new simulated data set I created in #83. It's small enough for kevlar to process in several minutes, but large enough to observe meaningful numbers in the profile. For kevlar novel, I computed k-mer counts from scratch, not using precomputed counts.

Here are the top 25 functions, sorted by time spent, for each.

# kevlar count
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)                                                                         
 51921279   61.111    0.000   61.111    0.000 {method 'get' of '_khmer.KHashtable   ' objects}                                                  
 36053659   46.578    0.000   46.578    0.000 {method 'add' of '_khmer.KHashtable   ' objects}                                                  
        3   38.588   12.863  187.651   62.550 counting.py:22(load_sample_seqfile)                                                               
   887492   29.514    0.000   29.514    0.000 {method 'get_kmers' of '_khmer.KHashtable   ' objects}                                            
   887495    5.401    0.000    5.401    0.000 __init__.py:97(multi_file_iter_khmer)                                                                887492    2.150    0.000    2.150    0.000 {method 'split' of '_sre.SRE_Pattern' objects}                                     
  1774984    1.703    0.000    6.031    0.000 __init__.py:103(clean_subseqs)                                                                    
   887819    1.123    0.000    1.157    0.000 re.py:278(_compile)                                                                
   887492    0.905    0.000    4.177    0.000 re.py:195(split)                                                                                  
        3    0.323    0.108    0.323    0.108 {method 'save' of '_khmer.KHashtable   ' objects}               
    82/54    0.154    0.002    0.209    0.004 {built-in method _imp.create_dynamic}           
896021/895669    0.153    0.000    0.153    0.000 {built-in method builtins.len}
      257    0.101    0.000    0.101    0.000 {built-in method __new__ of type object at 0x10ffa3080}
      522    0.050    0.000    0.050    0.000 {built-in method marshal.loads}
        3    0.027    0.009    0.027    0.009 {method 'read' of '_io.BufferedReader' objects}
    803/1    0.021    0.000  188.297  188.297 {built-in method builtins.exec}
     2761    0.019    0.000    0.019    0.000 {built-in method posix.stat}
  613/609    0.016    0.000    0.028    0.000 {built-in method builtins.__build_class__}
      243    0.015    0.000    0.015    0.000 {built-in method builtins.compile}
     1190    0.015    0.000    0.064    0.000 <frozen importlib._bootstrap_external>:1215(find_spec)
      522    0.012    0.000    0.018    0.000 <frozen importlib._bootstrap_external>:816(get_data)
        2    0.008    0.004    0.021    0.010 machar.py:116(_do_init)
     6226    0.007    0.000    0.019    0.000 <frozen importlib._bootstrap_external>:50(_path_join)
     6226    0.007    0.000    0.010    0.000 <frozen importlib._bootstrap_external>:52(<listcomp>)
      704    0.007    0.000    0.087    0.000 <frozen importlib._bootstrap>:879(_find_spec)

# kevlar novel
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 67446100   80.216    0.000   80.216    0.000 {method 'add' of '_khmer.KHashtable   ' objects}
  1183526   36.393    0.000   36.393    0.000 {method 'get_kmers' of '_khmer.KHashtable   ' objects}
 29633082   34.668    0.000   34.668    0.000 {method 'get' of '_khmer.KHashtable   ' objects}
        3   25.096    8.365  142.916   47.639 ./kevlar/counting.py:22(load_sample_seqfile)
 22496672   18.973    0.000   54.691    0.000 ./kevlar/novel.py:23(kmer_is_interesting)
        1   14.966   14.966  234.252  234.252 ./kevlar/novel.py:42(main)
   887495    5.090    0.000    5.090    0.000 ./kevlar/__init__.py:97(multi_file_iter_khmer)
   296035    3.894    0.000   10.359    0.000 /Users/standage/Projects/kevlar/env/lib/python3.5/site-packages/screed/fastq.py:14(fastq_iter)
   887492    1.877    0.000    1.877    0.000 {method 'split' of '_sre.SRE_Pattern' objects}
  1774984    1.387    0.000    4.978    0.000 ./kevlar/__init__.py:103(clean_subseqs)
  1184137    1.336    0.000    1.980    0.000 {method 'readline' of '_io.BufferedReader' objects}
  7784325    1.175    0.000    1.175    0.000 {method 'append' of 'list' objects}
  1183855    1.153    0.000    1.187    0.000 /Users/standage/Projects/kevlar/env/lib/python3.5/re.py:278(_compile)
  1184137    0.998    0.000    3.905    0.000 /usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/gzip.py:370(readline)
   887492    0.753    0.000    3.462    0.000 /Users/standage/Projects/kevlar/env/lib/python3.5/re.py:195(split)
  1184138    0.635    0.000    0.927    0.000 /usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/_compression.py:12(_check_not_closed)
   296037    0.544    0.000    0.544    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
  1184137    0.544    0.000    0.968    0.000 /Users/standage/Projects/kevlar/env/lib/python3.5/site-packages/screed/utils.py:4(to_str)
   296034    0.506    0.000    0.625    0.000 /Users/standage/Projects/kevlar/env/lib/python3.5/site-packages/screed/screedRecord.py:23(__init__)
  1184179    0.424    0.000    0.424    0.000 {method 'decode' of 'bytes' objects}
  1186054    0.366    0.000    0.366    0.000 {method 'startswith' of 'str' objects}
2429318/2428966    0.359    0.000    0.359    0.000 {built-in method builtins.len}
     8852    0.317    0.000    0.317    0.000 {method 'decompress' of 'zlib.Decompress' objects}
  1184140    0.292    0.000    0.292    0.000 /usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/gzip.py:296(closed)
   296037    0.289    0.000    1.154    0.000 /Users/standage/Projects/kevlar/env/lib/python3.5/re.py:170(search)

In both cases, the add and get functions for incrementing and querying k-mer counts from the Count-Min sketch dominate the runtime, with the get_kmers and re.split functions as heavy hitters as well. The latter two have to do with the fact that the khmer's bulk loading functions don't support the kind of preprocessing we need for kevlar, so I'm doing it in Python. This incurs overhead in sending the data from C++ to Python objects, and then doing the processing in Python.

As far as priorities, I'm not sure there's much we can do about the Count-Min sketch implementation. We could try to implement buffering (collect N≈1e4 add operations before actually incrementing the tables) but honestly once the CQF-based counter is integrated we may already get much better performance from that. For the sequence loading, I'll continue to lobby for better support for multiple pre-processing strategies with khmer bulk Fastq loading code.

Contig assembly

A while ago I replaced linear path assembly with junction count assembly in the kevlar collect step. This eliminated problems we previously had with cyclical structure in the assembly graph, but it doesn't seem to have done much about truncation at sequencing errors. I've been playing around with khmer's simple labeled assembler, but this will take some effort to get working.

@ctb suggested investigating abundance-based trimming / spur removal. Probably a medium-term priority.

In the mean time, I assembled the reads described in dib-lab/khmer#1657 with velvet (using vanilla parameters) and got almost the entire consensus sequence of the reads' multiple sequence alignment. If we can get partitioning working well, one preliminary approach could be to simply delegate assembly of each distinct CC to a third-party assembler.

Parallelize based on sequence content

Our current workflow and scripts support the idea of splitting data by chromosome, which is intended to:

make memory consumption more manageable
support parallel processing
make results smaller and easier to evaluate and interpret

The problem is that we want this to be a reference-free method, but we're relying increasingly on the reference and read mappings against the reference. The current criteria in kevlar dump are:

If the read alignment is a secondary or supplementary alignment, discard
If the read matches the reference genome perfectly, discard
If the read aligns to the chromosome of interest, do not discard
If at least half of the k-mers in the read are not present in the genome mask (the reference genome minus the chromosome of interest), do not discard

The first two points are fine, and when mapping data happens to be available we can leverage it. But we keep on running into problems with the second two points.

Today we discussed the following alternative approach.

In an initial run, consider only k-mers beginning with the nucleotide 'A'. As before, find those that are abundant in the proband and "absent" (for some value of "absent") in the parents. Record these "interesting" k-mers.
Repeat the run, but the second time consider only k-mers beginning with 'C'. The third time 'G', the fourth time 'T'.
Once we have a combined list of interesting k-mers, we can go over all of the proband reads, keep only the reads containing interesting k-mers, and use this to group k-mers by event (using the corresponding linear paths or assemblies).

A couple of considerations.

If reads are in BAM format, we could still take advantage of the reference genome to discard trivially uninteresting reads early on in the workflow.
When collecting reads with interesting k-mers, we may want to do a second pass to collect reads that share k-mers with reads containing interesting k-mers.
Rather than using a single nucleotide as a prefix, we could use a prefix of arbitrary length to split up the work as much as we'd like. One consideration is that some prefixes will be much more abundance than others.

Add (temporary) convenience functions until they can be integrated into khmer

load_nodetable
create_counttable
create_nodetable

The latter two can eventually be refactored into create_countgraph and create_nodegraph.

Implement local --> global coordinate conversion

Currently, assembled contigs are aligned against a small cutout of the reference genome for variant calling. We need to be able to convert the coordinates relative to the reference back to the global coordinates relative to the entire chromosome.

Replace explicit StringIO captures with pytest capsys

See tests/test_rcgrep.py for an example.

Discrepancy between final report and output file

The kevlar find command prints a summary of the results at the end of the program execution, like so.

    processed 629000000 reads...
    processed 630000000 reads...
    processed 631000000 reads...
[kevlar::find] Found 2374 novel kmers in 40279 reads, 2366 linear paths

However, there's a discrepancy between the number of reads reported and the number of reads found in the output file. The output file is the reads in Fastq format, adorned with additional lines containing the novel k-mers (and their abundances) with a trailing # character.

@ERR894723.92068452
GAATAGAATGGAATGGAAAGAATTGGAATGGAATGGAATCGAATGGAATGGAATGGAATGGAATGGAATGGAATCAACCCGAGTGCAGGGGAATGTAATGGATCGGAATGCAGTGGAATGGAATCA
+
70<0070B0<<0<00<<BB<0<0BB<<B0<B<70B<BBB7<0B<B<<B0B<<BBBBBBBBBB<<BBBBBBBBBBBBB<B7<B<BBBBBBBBBBB<BBBBBBBB7BBBBBBBB<BBBBBB<<B<<<<
                                                                                     CAGGGGAATGTAATGGATCGGAATGCAGTGG          26 0 0#
                                                                                      AGGGGAATGTAATGGATCGGAATGCAGTGGA          25 0 0#
                                                                                       GGGGAATGTAATGGATCGGAATGCAGTGGAA          26 0 0#
                                                                                        GGGAATGTAATGGATCGGAATGCAGTGGAAT          27 0 0#
                                                                                              GTAATGGATCGGAATGCAGTGGAATGGAATC          27 0 0#
                                                                                               TAATGGATCGGAATGCAGTGGAATGGAATCA          26 0 0#
@ERR899709.54045623
TGGAATGGAAAGAATTGGAATGGAATGGAATCGAATGGAATGGAATGGAATGGAATGGAATGGAATCAACCCGAGTGCAGGGGAATGTAATGGATCGGAATGCAGTGGAATGGAATCATCCGGAAT
+
<B<BBB<BBBBBBB<B<BBBBBBBBBBBBBB<7BB<BBBBBBB7B<<BB<<BBBB<BBBB0<<<BBB<B<B<7BB<B<BBBB<<B0B0<0BBBB0<7BBBBBBB0BBBB0BBB<B00000000<<<
                                                                             CAGGGGAATGTAATGGATCGGAATGCAGTGG          26 0 0#
                                                                              AGGGGAATGTAATGGATCGGAATGCAGTGGA          25 0 0#

Removing the extra metadata to get strict Fastq format is simple: grep -v '#$' file.txt > file.fq. We can then determine the number of reads in the Fastq file by computing the line count and dividing by 4.

$ wc -l NA19240.novel.fq 
161688 NA19240.novel.fq
$ python -c 'print(161688 / 4)'
40422

kevlar find: do not consume k-mers that are in the reference genome

The kevlar dump command implements a data reduction step that discards any reads that match the reference genome perfectly. After this step, however, kevlar makes no attempt to determine whether a k-mer is in the reference genome. As a result, we are seeing some false positives reported in cases where certain k-mers are "not present" in the parents (because all overlapping reads match the genome perfectly and are thus discarded before counting) but "high abundance" in the proband (a sufficient number of reads overlapping the k-mer have errors elsewhere in the read and are thus NOT discarded).

For novel variant discovery, we're not really interested in k-mers that are in the reference genome, and we need to update our counting strategy to reflect this. There are potential implications for memory consumption to consider and investigate.

Calculate FPR

When case and controls are loaded into counttables for kevlar find, the script should calculate FPR. This code already exists in khmer: in the short term, it's probably quickest and easiest to duplicate the code, but in the future it is probably worth refactoring the khmer code to make it reusable.

Give linear path assembly a chance!

I had all but condemned kevlar collect, which uses khmer's linear path assembly foo, to deprecation and phasing out. But it turns out that it seems to work just fine on kevlar-partitioned and abundance-trimmed data. I need to revisit this and perhaps integrated it with the existing kevlar assemble command.

Enable "strict" mode when partitioning reads

Currently, the kevlar filter command has the option to output reads partitioned by shared novel k-mers. The kevlar assemble command is stricter: it builds its graph not only with shared novel k-mers but also enforces that the entire overlaps must be identical. Two proposals.

a bit of refactoring for an independent kevlar partition command
enable a "strict" mode for partitioning that requires not only the shared k-mer but also identity in the overlapping portions of the associated reads

Multiple paths per variant

Currently kevlar find reports:

"interesting" k-mers
reads containing "interesting" k-mers
linear paths from the assembly graph baited by the "interesting" k-mers

In an ideal world, each variant would be associated with a single contig/path that is long enough to determine its position in the genome uniquely. Practically, that's not happening.

Consider the following example, newly introduced in #16. The case has a 5bp deletion relative to the controls. The kevlar find script found 11 "interesting" k-mers (k=13) in 16 reads, and these 11 k-mers pulled out 4 unique linear paths from the assembly graph.

        GGGTCCCTATTGACCTCTTTACCA
        GGGTCCCTATTGACC
              CTATTGACCTCTTTACCA
TGGGCTCGGGGTCCCTATTGACC

A couple of things to note.

The 2nd and 3rd contigs are subsequences of the 1st. These are easy to collapse, which is also introduced in #16 (so that the final output of kevlar find shows only two linear paths).
The 1st and 4th contigs overlap, but for some reason they do not exist as a single linear path in the assembly graph.

This second bullet point is what perplexes me now. All along when I've observed this in the human data, I assumed it was the result of sequencing errors introducing structure in the assembly graph and truncating linear paths. But for this example, I simulated reads with no errors or mutations. Why do either of these cases occur?

Make sample loading with dilution an option in kevlar count

Currently, kevlar novel loads each sample into a dedicated Bloom filter, each of size M, while kevlar count loads the first sample into a BF of size M and all subsequent samples into BFs of size M/x for some factor x. This latter behavior should be an option in kevlar count.

Investigate refactoring some of kevlar's main methods

Nothing major, but most of the commands output a list of read or contig sequences. The idea would be to convert the existing "main" methods into a generator that yields records instead of printing them, and then use a minimal wrapper around that as the "true" main method to actually print the output.

This would make it trivial to programmatically string together subcommands.

Add support for Fastq input in kevlar find

Currently, kevlar find only supports pre-computed count graphs. It would be nice if there was also an option to compute counts on the fly from Fastq input. This will clutter up the CLI a bit, especially if there's a reason to support different table sizes for different input files (although I don't think there is).

Support arbitrary numbers of cases

The kevlar find procedure already has support for an arbitrary number of controls. It should be trivial to introduce support for an arbitrary number of cases. For situations where we have many cases and controls and memory is a limiting factor, we'll have to use k-mer banding and multiple passes over the data.

Low complexity novel k-mers and contig assembly

The LinearPathAssembler kevlar has been using thus far to retrieve contigs from the assembly graph doesn’t handle non-branching cycles in the graph very well. As a stopgap solution, I added an option to kevlar collect to specify a list of k-mers to ignore when retrieving contigs from the graph. To find out which k-mers were problematic, I had to run kevlar collect a few dozen times with debugging print statements to find out which k-mers were causing it to choke. Luckily kevlar collect is pretty quick and this didn’t take a huge amount of time, even though it was tedious. For my latest run (results from 32 kevlar find runs with k-mer banding), there were 12 problematic k-mers. Adding these to the kevlar collect --ignore flag leads to a successful run.

Since all of these problematic k-mers were AT-rich and low-complexity, my first thought for a more robust solution was to filter out low-complexity novel k-mers. I poked around with this a bit, but I’m concerned that it will lack specificity, and it turned out complexity scores are not as trivial to compute as I initially thought.

And then of course I thought of the fact that we’ve been talking about using a less naive assembly approach for a long time and tried substituting the LinearPathAssembler with the JunctionCountAssembler. This was a straightforward substitution, and handled all of the previously problematic k-mers with finesse. I still need to evaluate the contigs assembled by the two different assemblers, but this seems like the best long-term solution, especially if it gives us longer contigs.

Add pairing information to output of dump command

The kevlar commands don't use pairing information, so I haven't really worried to much about it before. But some parts of kevlar do rely on reads having unique IDs, and kevlar dump currently prints the same ID for reads belonging to the same pair. This is a straightforward fix and should be done.

Refactor CLI

Related to #52, making kevlar an entry point rather than a script and reorganizing the argument parsers a bit will not only make them easier to test (increasing coverage), but will also make it easier to replace StringIO with capsys for capturing test output.

PCR duplicates

Since pre-processing input BAM files prior to analysis isn't an option, we should be able to handle this on a partition-by-partition basis. We don't have access to both ends of a read, so duplicate sequences MIGHT not be PCR duplicates, but we can deduplicate anyway (make it an option to disable deduplication).

Drop pysam dependency

It's heavy and kludgy for CI (which means it's probably a pain on a lot of HPC as well). Need Python-side support for BAM parsing, but something in Pure python would be preferred.

Drop Cython as a install dependency, add to `make devenv` target

The cythonized C file is distributed with the code base, so the user shouldn't need to have Cython installed to compile and install and run kevlar.

Reported k-mer abundance doesn't match actual abundance when banding

When using the banding approach, there is a discrepancy between the reported abundance and actual abundance in some of the k-mers reported as novel by kevlar find. This is particularly puzzling because the overall false positive rate is fairly low (0.14), but the k-mers in question have a reported abundance of >10 but an actual abundance of 1 (in all of the cases I’ve had time to fully investigate). Even more puzzling is that when I load the data into memory later (with an interactive python session) I cannot reproduce the error.

Poor performance when collapsing contigs

Currently the variantset.collapse method takes a very naive approach to collapsing contigs that are subsequences of other contigs. The contigs are first sorted by length, and then added to a new set of unique contigs only if they are not a subsequence of any contig already in the unique set.

unique_contigs = set()
for contig in sorted(contigs, key=len, reverse=True):
    contigrc = kevlar.revcom(contig)
    keep =  True
    for uc in unique_configs:
        if contig in uc or contigrc in uc:
            keep = False
            break
    if keep:
        unique_contigs.add(contig)

I’ve tried both a set (fast membership queries, slow iteration) and list (fast iteration, slow membership queries) as the container variable unique_contigs, but the naive O(N^2) algorithm washes out any small benefit of changing the data structure.

I explored an alternative approach using a suffix tree and got a huge speedup (≈15min to ≈20 seconds).

unique_contigs = set()
suffix_tree = trie()
for contig in sorted(contigs, key=len, reverse=True):
    contigrc = kevlar.revcom(contig)
    query = suffix_tree.with_prefix(contig) + \
                  suffix_tree.with_prefix(contigrc)
    if query == []:
        unique_contigs.add(contig)
        for i in range(len(contig)):
            suffix_tree[contig[i:]] = 1

Two concerns with this approach.

This introduces BioPython as a dependency, which is not lightweight (it’s a huge library with some C extensions, including the backend of the trie implementation). I couldn’t find any Pure python suffix tree implementations: probably a lost cause with respect to memory overhead.
The number of collapsed contigs isn’t identical to the previous approach, which will require time to investigate potential issues with both approaches. A bit more evaluation is probably warranted anyway to be more confident it’s doing what we want, but this will take a bit of time.

Separate filtering/deduplication from assembly

From @ctb:

In update/filter_below_abund branch of khmer, I’ve fixed a few sandbox scripts and implemented dedup.py. I then ran the attached commands.

Briefly, they —

de-duplicate your reads based on name
load them into a counting table
slice out reads between coverage 2 and 120 (into .selected)
trim off low-abundance k-mers
partition
extract partitions
build a cDBG for group 12 of the partitions, which contains 6 partitions.

python dedup.py all-fq > all-fq.dedup
../scripts/abundance-dist.py all-fq.dedup all-fq.dedup.hist -k 31
../scripts/abundance-dist-single.py -k 31 -M 1e8 all-fq.dedup all-fq.dedup.hist
../scripts/load-into-counting.py -k 31 -M 1e8 dedup.kh all-fq.dedup 
../sandbox/calc-median-distribution.py dedup.kh all-fq.dedup all-fq.dedup.medhist
../sandbox/slice-reads-by-coverage.py -m 2 -M 120 dedup.kh all-fq.dedup all-fq.dedup.selected
../scripts/trim-low-abund.py -k 21 -M 1e8 all-fq.dedup.selected
../scripts/abundance-dist-single.py -k 21 all-fq.dedup.selected.abundtrim all-fq.dedup.selected.hist -M 1e8 -s
../scripts/do-partition.py -k 31 -M 1e9 y2 all-fq.dedup.selected.abundtrim 
../scripts/extract-partitions.py -X 5000 foo all-fq.dedup.selected.abundtrim.part
../sandbox/extract-compact-dbg.py -x 1e9 -k 31 foo.group0012.fq -o foo.group0012.fq.gml
grep ^@ foo.group0012.fq | cut -f2 | sort | uniq

When kevlar find is run in banding mode, some reads will show up multiple times across the various outputs ("interesting" k-mers from each read will be distributed across different bands). Currently, kevlar collect handles:

de-duplicating the reads
re-computing the abundances of the "interesting" k-mers (in some cases the initial abundances are highly inflated)
filtering out k-mers that are in reference genome (occurs sometimes when all of the overlapping reads from the parents are perfect matches and thus discarded, but there are a sufficient number of reads from the proband that have errors elsewhere and are thus retained)
assembling contigs from the vetted novel k-mers
grouping and collapsing contigs

The first 3 filtering steps should be decoupled from the graph / assembly steps in kevlar. Maybe I'll call it kevlar recalibrate, because NGS.

Report kevlar version when running a script

...for reproducibility's sake, yo.

Automated test suite vs analysis

There are a variety of parameters in the variant discovery problem that we should explore at some point.

sequencing error rate
variant type / size
k size (see #20)
read length
memory size / FPR
nucleotide composition

In #16 I started down this road, and began to integrate this into the automated test suite. That was a step in the wrong direction. The test suite should guarantee correctness of the basic behavior and test for bug regressions. The type of exploratory analysis needed here should be handled separately.

Report test coverage

Add timing to log output

time to load each sample into a counttable
time to load all cases/controls
time to do second pass over data
time intervals between debugging updates

Support reading compressed .augfastq files

Already supported for khmer consume_ functions, but not yet for augmented Fastq files.

kevlar-dev / kevlar Goto Github PK

kevlar's Issues

Recommend Projects

Recommend Topics

Recommend Org