The nanocompore from tleonardi

Short ref transcript: NanocomporeError: Not enough p-values for a context of order 2

When using a very short reference transcript (e.g. some present in the latest GENCODE GTF annotation) nanocompore fails with:

nanocompore.common.NanocomporeError: Not enough p-values for a context of order 2

nanocompore/nanocompore/TxComp.py

Line 307 in 9e5e75d

if len(pvalues_vector)<(context*3)+3:

In my case some transcript were 8 < 9 (context=2)

Is it possible to discard the reference throwing warning but without terminating the program?

Artificial data generator

We need to test nancompore with artificial data.

Add multiple hypothesis correction

Needs to be implemented, most likely in SampComp.__write_output()

Improve approximation of dwell time

Using n_events for dwell time is not ideal. Would be better to use end-start of raw signal. To obtain those from nanopolish comp need to update nanopolish to last version and use flag --signal-index

RuntimeWarning: overflow encountered in true_divide

nanocompore sampcomp --file_list1 nanocompore/BC2_nanopolish_collapsed_reads.tsv --file_list2 nanocompore/BC3_nanopolish_collapsed_reads.tsv --label1 Red --label2 BH --fasta Reference/Mus_musculus.fa --bed Reference/Mus_musculus.bed --outpath /nanocompore/comp --min_coverage 25 --nthreads 10 --max_invalid_kmers_freq 1.0 --comparison_methods GMM,KS,MW --overwrite

Initialise SampComp and checks options
/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/SampComp.py:89: NanocomporeWarning: Only 1 replicate found for condition Red. This is not recomended. The statistics will be calculated with the logit method
  warn(NanocomporeWarning ("Only 1 replicate found for condition {}. This is not recomended. The statistics will be calculated with the logit method".format(cond_lab)))
/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/SampComp.py:89: NanocomporeWarning: Only 1 replicate found for condition BH. This is not recomended. The statistics will be calculated with the logit method
  warn(NanocomporeWarning ("Only 1 replicate found for condition {}. This is not recomended. The statistics will be calculated with the logit method".format(cond_lab)))
Initialise Whitelist and checks options
Read eventalign index files
	References found in index: 2
Filter out references with low coverage
	References remaining after reference coverage filtering: 2
Start data processing
100%|████████████████████████████████████████████████████████████| 2/2 [00:21<00:00,  7.64s/ Processed References]
/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/statsmodels/stats/multitest.py:325: RuntimeWarning: overflow encountered in true_divide
  pvals_corrected_raw = pvals_sorted / ecdffactor

Add GMM model plotting to plot_position()

Multidimentional kmean

Originally posted by @a-slide in #9 (comment)

Another option, at least for the kmean strategy, is a multidimentional kmean combining all adjacent positions instead of position per position and then combining pvalues.

Nanopolish discarded reads

Do you know the meaning of these qc-fail and could not calibrate fields?
Is it possible we are losing info in the qc-fail group for Red (treated) samples?

## Ctrl [post-run summary] total reads: 1075, unparseable: 0, qc fail: 91 , could not calibrate: 652 , no alignment: 9, bad fast5: 0
## Red  [post-run summary] total reads: 3118, unparseable: 0, qc fail: 455 , could not calibrate: 1078 , no alignment: 24, bad fast5: 0

It is however possible that it is a problem with extreme datasets as this.

SampCompDB results()

Call the results() function from init()

Threads over-subscription

It looks like nanocompore sometimes spawns more threads than it should.. Starting it with nthreads=4 with the 7SK IVT data starts 16 threads.

Output coordinates are wrong for spliced transcripts

Need to load reference BED file and recompute coordinates. Could use bedparse to convert from transcriptome-space to genome-space

Add tests for correct handling of 0 within group variance

Include model of mean intensity in the prediction

Might be an option to refine the prediction, base on ONT model mean value from https://github.com/nanoporetech/kmer_models

Bedparse from the CLI doesn't create BED files

nanocompore/nanocompore/nanocompore_main.py

Line 132 in 183296c

r = re.compile("adjusted_*")

This RE doesn't match anything anymore.

Add reporting/plotting for missing positions

Problems with paired_test() p-values

paired_test() in TxComp takes batches of data points and does the S1 vs S2 test for each batch. Then it takes the median of the p values obtained.
We have to carefully think about this... I am not sure this is an effective way of controlling false positives...
What about getting rid of this and doing p-values correction instead?

SampCompDB.__get_ref_data()

Hey @a-slide ,
I was thinking of getting rid of this function because it looks like its functionality is already covered by getitem
Is there a reason why I should keep it instead?

Option to plot bars in plot_pvalue()

Add option to plot bars to report the pvalue.

Testing on artificial data

Generate an artificial transcriptome with all possible kmers
Benchamark nanocompore on one transcript with modification at varying %

Add detailed tabular reports

In addition to BED files we should also produce tabular output that reports median/dwell at each position

Is order 2 the best GMM?

We could add the GMM order as an option.
But what about the logistig regression?

Unexposed log_dwell options

The log_dwell option of gmm_test() is not used anywhere and should be removed.

multipletests and NaN

Commit 4c8a7f0 removed __multipletests_filter_nan() from sampCompDB.py. However the function is still needed, because it can happen that certain positions (e.g. 5') have coverage lower than the threshold, resulting in NaN pvalues.
If statsmodels.stats.multitest.multipletests is given an array containing at least one NaN, it returns an array of all NaN.

Correlated pvalue

The pvalues for neighbouring positions are highly correlated, thus violating the assumptions of Fisher's method for combining p-values.
Based on the benchmarking done in this paper, Hou's method seems the best solution.

kmean method requires a very high coverage

I tried the package on a large dataset and my impression is that the kmean methods requires a coverage of at least 75 in both samples to call anything significant. This will be an issue for the large majority of the transcripts.
This should be further tested to determine the threshold.

Rewrite documentation

The documentation is outdated

Error when running without BED

Traceback (most recent call last):
  File "/home/lp471/bin/nanocompore1.0.0b1/bin/nanocompore", line 11, in <module>
    sys.exit(main())
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/nanocompore_main.py", line 71, in main
    args.func(args)
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/nanocompore_main.py", line 124, in sample_compare_main
    sc_out.save_report(output_fn=f"{outpath}/nanocompore_results.txt")
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/SampCompDB.py", line 268, in save_report
    for record in self.results[['chr', 'pos', 'ref','strand', 'ref_kmer', 'lowCov']+methods].values.tolist():
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/pandas/core/frame.py", line 2682, in __getitem__
    return self._getitem_array(key)
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/pandas/core/frame.py", line 2726, in _getitem_array
    indexer = self.loc._convert_to_indexer(key, axis=1)
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/pandas/core/indexing.py", line 1327, in _convert_to_indexer
    .format(mask=objarr[mask]))
KeyError: "['chr' 'strand'] not in index"

Use replicates to control for false positives

Permutation of all samples.
Introduce the ability to compare multiple samples at the same time.

Make a docker container or a conda package

Upon release it would be great to provide a Docker container with the current version of Nanopolish + NanopolishComp + nanocompore + all dependencies.

CLI is broken

Add CLI

Error in SampComDB when using sequence_context in SampComp

Here is the trackback

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-12-d4623718066d> in <module>
      9 
     10 # Run the analysis
---> 11 db = s ()

~/Programming/nanocompore/nanocompore/SampComp.py in __call__(self)
    226 
    227         # Return database wrapper object
--> 228         return SampCompDB(db_fn=self.__output_db_fn, fasta_fn=self.__fasta_fn, bed_fn=self.__bed_fn)
    229 
    230     #~~~~~~~~~~~~~~PRIVATE MULTIPROCESSING METHOD~~~~~~~~~~~~~~#

~/Programming/nanocompore/nanocompore/SampCompDB.py in __init__(self, db_fn, fasta_fn, bed_fn, run_type)
     86         #Create results df with adjusted p-values
     87         if self._pvalue_tests:
---> 88             self.results = self.__calculate_results(adjust=True)
     89 
     90     def __repr__(self):

~/Programming/nanocompore/nanocompore/SampCompDB.py in __calculate_results(self, adjust)
    153         if adjust:
    154             for col in self._pvalue_tests:
--> 155                 df[col] = self.__multipletests_filter_nan(df[col], method="fdr_bh")
    156         return df
    157 

~/Programming/nanocompore/nanocompore/SampCompDB.py in __multipletests_filter_nan(pvalues, method)
    202         """
    203         pvalues_no_nan = [p for p in pvalues if not np.isnan(p)]
--> 204         corrected_p_values = multipletests(pvalues_no_nan, method=method)[1]
    205         for i, p in enumerate(pvalues):
    206             if np.isnan(p):

~/.virtualenvs/Python3.6/lib/python3.6/site-packages/statsmodels/stats/multitest.py in multipletests(pvals, alpha, method, is_sorted, returnsorted)
    142 
    143     ntests = len(pvals)
--> 144     alphacSidak = 1 - np.power((1. - alphaf), 1./ntests)
    145     alphacBonf = alphaf / float(ntests)
    146     if method.lower() in ['b', 'bonf', 'bonferroni']:

ZeroDivisionError: float division by zero

Whitelist `max_mismatching_freq` and `max_missing_freq`

Originally posted by @tleonardi in https://github.com/tleonardi/nanocompore/diffs/0

Add tests

Try to get full coverage for all the core functions.
In particular, pay attention to potential off-by-one errors

Large nanopolishComp .tsv files. Compressed/binary files?

I am now testing the pipeline on poliA+ RNA runs, with read numbers closer to a real biological situation.

The problem is that for a 650k read sample, the program now produces a ~50GB flat text file.
I was wondering how difficult it would be to make nanocompore read a gzipped (if not some other form of binary) file. Otherwise, my impression is that I/O will bottleneck every subsequent handling of the results files and overwhelm the storage very quickly.
Currently, a 3M reads run (which is biologically acceptable, but for sure far from saturation!) produce a total of ~>600GB files after analysis!

Thank you,
Luca

Missing GMM pvalues in plot_pvalue

It is GMM specific and started after TxComp refactor.

Here is an example plot.

Replicates have to be named differently

You cannot have replicate called the same in the 2 conditions. I assumed it uses only 1 condition if not as it raises 0 variance errors.

Doesn't work

eventalign_fn_dict = {
        'Modified': {'rep1':'./sample_files/modified_rep_1.tsv', 'rep2':'./sample_files/modified_rep_2.tsv'},
        'Unmodified': {'rep1':'./sample_files/unmodified_rep_1.tsv', 'rep2':'./sample_files/unmodified_rep_2.tsv'}}

Works fine

eventalign_fn_dict = {
        'Modified': {'rep1':'./sample_files/modified_rep_1.tsv', 'rep2':'./sample_files/modified_rep_2.tsv'},
        'Unmodified': {'rep3':'./sample_files/unmodified_rep_1.tsv', 'rep4':'./sample_files/unmodified_rep_2.tsv'}}

The problem started after txComp refactor

nanocompore_results is unsorted (?)

nanocompore_results.txt has results unsorted by position. Please check whether it is behaving as expected. #

Replicates

Is the current approach of logistic regression the best?
What about a negative binomial on the cluster counts? For example wald test on a logistic model ~Cluster+Condition vs ~Cluster

Calculate an empirical pvalue for the kmeans method

Error when SampCompDB is run without a BED file

These headers are only defined when SampCompDB is run with a BED file

nanocompore/nanocompore/SampCompDB.py

Line 283 in ae9c5dd

    
           for record in self.results[['chr', 'pos', 'ref_id','strand', 'ref_kmer']+self._pvalue_tests].values.tolist():

Line 147 in ae9c5dd

df=df[['ref_id', 'pos', 'ref_kmer']+tests]

Logit always reports counts of 0 for condition 2 cluster 2

nanocompore/nanocompore/TxComp.py

Line 253 in 183296c

    
           contingency_table = "%s:%s/%s__%s:%s/%s" % (condition_labels[0], S1_counts[0], S1_counts[1], condition_labels[1], S2_counts[0], S2_counts[2])

Simplify input datasets in CLI

What do you think of having an imput YAML file instead of the 4 options required at the moment ?

sample1:
    rep1:   path/to/sample1/rep1/data
    rep2:   path/to/sample1/rep2/data
sample2:
    rep1:   path/to/sample2/rep1/data
    rep2:   path/to/sample2/rep2/data

It would be parsed directly as required by nanocompore

tleonardi / nanocompore Goto Github PK

nanocompore's People

Contributors

Stargazers

Watchers

Forkers

nanocompore's Issues

Recommend Projects

Recommend Topics

Recommend Org