Giter Site home page Giter Site logo

tleonardi / nanocompore Goto Github PK

View Code? Open in Web Editor NEW
77.0 9.0 12.0 140.96 MB

RNA modifications detection from Nanopore dRNA-Seq data

Home Page: https://nanocompore.rna.rocks

License: GNU General Public License v3.0

Python 100.00%
rna-modifications bioinformatics nanopore

nanocompore's People

Contributors

a-slide avatar tleonardi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nanocompore's Issues

Short ref transcript: NanocomporeError: Not enough p-values for a context of order 2

When using a very short reference transcript (e.g. some present in the latest GENCODE GTF annotation) nanocompore fails with:

nanocompore.common.NanocomporeError: Not enough p-values for a context of order 2

if len(pvalues_vector)<(context*3)+3:

In my case some transcript were 8 < 9 (context=2)

Is it possible to discard the reference throwing warning but without terminating the program?

Improve approximation of dwell time

Using n_events for dwell time is not ideal. Would be better to use end-start of raw signal. To obtain those from nanopolish comp need to update nanopolish to last version and use flag --signal-index

RuntimeWarning: overflow encountered in true_divide

nanocompore sampcomp --file_list1 nanocompore/BC2_nanopolish_collapsed_reads.tsv --file_list2 nanocompore/BC3_nanopolish_collapsed_reads.tsv --label1 Red --label2 BH --fasta Reference/Mus_musculus.fa --bed Reference/Mus_musculus.bed --outpath /nanocompore/comp --min_coverage 25 --nthreads 10 --max_invalid_kmers_freq 1.0 --comparison_methods GMM,KS,MW --overwrite

Initialise SampComp and checks options
/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/SampComp.py:89: NanocomporeWarning: Only 1 replicate found for condition Red. This is not recomended. The statistics will be calculated with the logit method
  warn(NanocomporeWarning ("Only 1 replicate found for condition {}. This is not recomended. The statistics will be calculated with the logit method".format(cond_lab)))
/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/SampComp.py:89: NanocomporeWarning: Only 1 replicate found for condition BH. This is not recomended. The statistics will be calculated with the logit method
  warn(NanocomporeWarning ("Only 1 replicate found for condition {}. This is not recomended. The statistics will be calculated with the logit method".format(cond_lab)))
Initialise Whitelist and checks options
Read eventalign index files
	References found in index: 2
Filter out references with low coverage
	References remaining after reference coverage filtering: 2
Start data processing
100%|████████████████████████████████████████████████████████████| 2/2 [00:21<00:00,  7.64s/ Processed References]
/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/statsmodels/stats/multitest.py:325: RuntimeWarning: overflow encountered in true_divide
  pvals_corrected_raw = pvals_sorted / ecdffactor

Multidimentional kmean

Originally posted by @a-slide in #9 (comment)

Another option, at least for the kmean strategy, is a multidimentional kmean combining all adjacent positions instead of position per position and then combining pvalues.

Nanopolish discarded reads

Do you know the meaning of these qc-fail and could not calibrate fields?
Is it possible we are losing info in the qc-fail group for Red (treated) samples?

## Ctrl [post-run summary] total reads: 1075, unparseable: 0, qc fail: 91 , could not calibrate: 652 , no alignment: 9, bad fast5: 0
## Red  [post-run summary] total reads: 3118, unparseable: 0, qc fail: 455 , could not calibrate: 1078 , no alignment: 24, bad fast5: 0

It is however possible that it is a problem with extreme datasets as this.

Threads over-subscription

It looks like nanocompore sometimes spawns more threads than it should.. Starting it with nthreads=4 with the 7SK IVT data starts 16 threads.

Problems with paired_test() p-values

paired_test() in TxComp takes batches of data points and does the S1 vs S2 test for each batch. Then it takes the median of the p values obtained.
We have to carefully think about this... I am not sure this is an effective way of controlling false positives...
What about getting rid of this and doing p-values correction instead?

SampCompDB.__get_ref_data()

Hey @a-slide ,
I was thinking of getting rid of this function because it looks like its functionality is already covered by getitem
Is there a reason why I should keep it instead?

Testing on artificial data

  • Generate an artificial transcriptome with all possible kmers
  • Benchamark nanocompore on one transcript with modification at varying %

multipletests and NaN

Commit 4c8a7f0 removed __multipletests_filter_nan() from sampCompDB.py. However the function is still needed, because it can happen that certain positions (e.g. 5') have coverage lower than the threshold, resulting in NaN pvalues.
If statsmodels.stats.multitest.multipletests is given an array containing at least one NaN, it returns an array of all NaN.

Correlated pvalue

The pvalues for neighbouring positions are highly correlated, thus violating the assumptions of Fisher's method for combining p-values.
Based on the benchmarking done in this paper, Hou's method seems the best solution.

kmean method requires a very high coverage

I tried the package on a large dataset and my impression is that the kmean methods requires a coverage of at least 75 in both samples to call anything significant. This will be an issue for the large majority of the transcripts.
This should be further tested to determine the threshold.

Error when running without BED

Traceback (most recent call last):
  File "/home/lp471/bin/nanocompore1.0.0b1/bin/nanocompore", line 11, in <module>
    sys.exit(main())
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/nanocompore_main.py", line 71, in main
    args.func(args)
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/nanocompore_main.py", line 124, in sample_compare_main
    sc_out.save_report(output_fn=f"{outpath}/nanocompore_results.txt")
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/SampCompDB.py", line 268, in save_report
    for record in self.results[['chr', 'pos', 'ref','strand', 'ref_kmer', 'lowCov']+methods].values.tolist():
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/pandas/core/frame.py", line 2682, in __getitem__
    return self._getitem_array(key)
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/pandas/core/frame.py", line 2726, in _getitem_array
    indexer = self.loc._convert_to_indexer(key, axis=1)
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/pandas/core/indexing.py", line 1327, in _convert_to_indexer
    .format(mask=objarr[mask]))
KeyError: "['chr' 'strand'] not in index"

Error in SampComDB when using sequence_context in SampComp

Here is the trackback

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-12-d4623718066d> in <module>
      9 
     10 # Run the analysis
---> 11 db = s ()

~/Programming/nanocompore/nanocompore/SampComp.py in __call__(self)
    226 
    227         # Return database wrapper object
--> 228         return SampCompDB(db_fn=self.__output_db_fn, fasta_fn=self.__fasta_fn, bed_fn=self.__bed_fn)
    229 
    230     #~~~~~~~~~~~~~~PRIVATE MULTIPROCESSING METHOD~~~~~~~~~~~~~~#

~/Programming/nanocompore/nanocompore/SampCompDB.py in __init__(self, db_fn, fasta_fn, bed_fn, run_type)
     86         #Create results df with adjusted p-values
     87         if self._pvalue_tests:
---> 88             self.results = self.__calculate_results(adjust=True)
     89 
     90     def __repr__(self):

~/Programming/nanocompore/nanocompore/SampCompDB.py in __calculate_results(self, adjust)
    153         if adjust:
    154             for col in self._pvalue_tests:
--> 155                 df[col] = self.__multipletests_filter_nan(df[col], method="fdr_bh")
    156         return df
    157 

~/Programming/nanocompore/nanocompore/SampCompDB.py in __multipletests_filter_nan(pvalues, method)
    202         """
    203         pvalues_no_nan = [p for p in pvalues if not np.isnan(p)]
--> 204         corrected_p_values = multipletests(pvalues_no_nan, method=method)[1]
    205         for i, p in enumerate(pvalues):
    206             if np.isnan(p):

~/.virtualenvs/Python3.6/lib/python3.6/site-packages/statsmodels/stats/multitest.py in multipletests(pvals, alpha, method, is_sorted, returnsorted)
    142 
    143     ntests = len(pvals)
--> 144     alphacSidak = 1 - np.power((1. - alphaf), 1./ntests)
    145     alphacBonf = alphaf / float(ntests)
    146     if method.lower() in ['b', 'bonf', 'bonferroni']:

ZeroDivisionError: float division by zero

Add tests

Try to get full coverage for all the core functions.
In particular, pay attention to potential off-by-one errors

Large nanopolishComp .tsv files. Compressed/binary files?

I am now testing the pipeline on poliA+ RNA runs, with read numbers closer to a real biological situation.

The problem is that for a 650k read sample, the program now produces a ~50GB flat text file.
I was wondering how difficult it would be to make nanocompore read a gzipped (if not some other form of binary) file. Otherwise, my impression is that I/O will bottleneck every subsequent handling of the results files and overwhelm the storage very quickly.
Currently, a 3M reads run (which is biologically acceptable, but for sure far from saturation!) produce a total of ~>600GB files after analysis!

Thank you,
Luca

Replicates have to be named differently

You cannot have replicate called the same in the 2 conditions. I assumed it uses only 1 condition if not as it raises 0 variance errors.

Doesn't work

eventalign_fn_dict = {
        'Modified': {'rep1':'./sample_files/modified_rep_1.tsv', 'rep2':'./sample_files/modified_rep_2.tsv'},
        'Unmodified': {'rep1':'./sample_files/unmodified_rep_1.tsv', 'rep2':'./sample_files/unmodified_rep_2.tsv'}}

Works fine

eventalign_fn_dict = {
        'Modified': {'rep1':'./sample_files/modified_rep_1.tsv', 'rep2':'./sample_files/modified_rep_2.tsv'},
        'Unmodified': {'rep3':'./sample_files/unmodified_rep_1.tsv', 'rep4':'./sample_files/unmodified_rep_2.tsv'}}

The problem started after txComp refactor

Replicates

Is the current approach of logistic regression the best?
What about a negative binomial on the cluster counts? For example wald test on a logistic model ~Cluster+Condition vs ~Cluster

Improve p-value adjustment

For p-value adjustment we should ignore positions with low coverage in order to increase power. Conceputally, we should implement idependent filtering as in this paper

Implement multidimentional clustering + Gaussian mixture model

Instead of using a pvalue aggregation method, it would probably be better to use a multidimensional clustering method to take into account adjacent positions

In addition a Gaussian mixture model will certainly be better than a kmean methods as it does not assume that the shape of the cluster is ovoid

Add option to scale coverage

When the coverage of two samples if very different one of the two lines becomes almost flat (e.g. in 7SK). We should add an option to scale the coverage to the mean.

Simplify input datasets in CLI

What do you think of having an imput YAML file instead of the 4 options required at the moment ?

sample1:
    rep1:   path/to/sample1/rep1/data
    rep2:   path/to/sample1/rep2/data
sample2:
    rep1:   path/to/sample2/rep1/data
    rep2:   path/to/sample2/rep2/data

It would be parsed directly as required by nanocompore

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.