Giter Site home page Giter Site logo

dinopore's Introduction

DInoPORE: Direct detection of INOsines in native RNA with nanoPORE sequencing

OVERVIEW

DInoPORE is a computational method to detect Adenosine-to-Inosine (A-to-I) editing sites from direct-RNA sequencing data. Editing sites identified will also have their editing rate estimated.

This GitHub repository contains the scripts to run DInoPORE. Users who wish to train their own models using DInoPORE's architecture can also find the scripts for training (see training path section below).

A sample dataset to illustrate how the pipeline runs and the typical runtime for a small sample can be found in DInoPORE's Code Ocean capsule (DOI: 10.24433/CO.2180901.v1). This dataset contains 53,283 direct-RNA-seq reads from Xenopus Laevis stage 9 rep 1. R Markdown files reproducing key results from the DInoPORE manuscript can also be found in the Code Ocean capsule under /data/Rmarkdown.

Listed below are the computational environment and software used. Usage documentation can be found in the last section of this readme.

=====================================================================================

REQUISITE SOFTWARE

Guppy_basecaller 3.2.4 (Please modify S1.Basecall_map_nanopolish.sh to point to it)

Third party software included

  • Graphmap2 0.6.3
  • Sam2tsv (part of Jvarkit) 34d8e7f7
  • Picard 2.21.6

conda env

  • python 3.8.5
  • h5py 2.10.0
  • nanopolish 0.11.1
  • pillow 8.3.1
  • pyyaml 5.4.1
  • requests 2.26.0
  • samtools 1.9
  • scipy 1.7.1

R Packages (R version 4.1)

  • Matrix 1.3.4
  • R.utils 2.11.0
  • Rcpp 1.0.7
  • abind 1.4.5
  • caret 6.0.89
  • data.table 1.14.2
  • doParallel 1.0.16
  • ff 4.0.4
  • foreach 1.5.1
  • keras >= 2.3.0
  • multiROC 1.1.1
  • optparse 1.6.6
  • pacman 0.5.1
  • plyr 1.8.6
  • pracma 2.3.3
  • scales 1.1.1
  • tensorflow >= 2.3.0
  • tidyverse 1.3.1
  • usefun 0.4.8
  • zoo 1.8.9

DInoPORE has been tested on CentOS Linux 7 and Ubuntu 20.04.

=====================================================================================

INSTALLATION - Create conda env

conda create -n myenv python=3.8.5 h5py=2.10.0 nanopolish=0.11.1 pillow=8.3.1 pyyaml=5.4.1 requests=2.26.0 samtools=1.9 scipy=1.7.1

conda activate myenv

=====================================================================================

USAGE

bash mainscript1.sh -e <path/to/exptdir> -r <path/to/ref.fa> -n <num_threads> -g <aggregation_Group>

bash mainscript2.sh -e <path/to/exptdir> -r <path/to/ref.fa> -n <num_threads> -g <aggregation_Group> -c <class_Reference>

Arguments required
	-e Full path to directory containing "fast5" folder. User must have write permission for the parent directory of this directory too.

	-r Full path to reference genome FASTA

	-n Number of threads available to DInoPORE

	-g User-defined group name used to specify runs belonging to same group. Affects aggregation step when aggregating across multiple experiment runs.

	-c Class and edit rate reference containing ground truth information about a coordinates' class and edit rate (see code/misc/Example_classref.tsv for format)

Optional argument
	-d [y/n] Delete base-called fastq files? Default value is n.

E.g. bash mainscript.sh -e /data/xen_s9_r1_50k -r /data/reference/xenLae2.fa -n 15 -g xen50k -c /data/xen_s9_r1_50k/groundtruth_class_regression.tsv

NOTE: mainscript.sh expects to find "fast5" directory within exptdir:

path/to/exptdir
└── fast5

=====================================================================================

DOCUMENTATION

Run Mainscript1.sh (Steps 1 to 3) on individual sequencing runs

(1) Basecall fast5 -> map to genome reference -> run nanopolish to extract signal

Script(s):
S1.Basecall_map_nanopolish.sh (input: $exptdir $ref $numcore)

Output:
${exptdir}/out_fastq_bam/${expt}.combined.fastq
${exptdir}/out_fastq_bam/${expt}.combined.gm2.sam
${exptdir}/out_fastq_bam/${expt}.sorted.bam
${exptdir}/out_nanopolish/$expt.gm2.nanopolish_eventAlignOut.txt

(2) Process data: convert bam file to tsv and combine nanopolish into single signal for each 5-mer of a read

Script(s):
S2.Process_bam_nnpl.sh (input: $exptdir $ref $numcore)
└── s2.Sam2tsv_processtsv.sh
└── s2.Combine_raw_nnpl.sh
	└── s2.Combine_raw_nanopolish.R

Output:
${exptdir}/raw_features/$expt.tsv.txt
${exptdir}/raw_features/$expt.gm2.nanopolish_eventAlignOut_combined.txt

(3) Combine bam-tsv file and combined nanopolish file to generate raw features table

Script(s):
S3.Generate_raw_features.sh (input: $exptdir $numcore $agggrp)
└── S3.Generate_raw_features.sh
	└── s3.Generate_raw_feature_table.R

Output:
$(dirname $exptdir)/aggregate_reads/${expt}.tsv_nnpl_inAE.txt_grp${agggrp}

Run Mainscript2.sh (Steps 4 to 6)

(4) Aggregate features of reads into positions

Script(s):
S4.Aggregate_reads_into_pos.sh (input: $exptdir $numcore $agggrp)
	└── s4.Aggregating_reads_pos.R
	
Output:
$(dirname $exptdir)/matrix_CNN/$agggrp.Agg.morefts.10bin.inML.txt

NOTE: This script aggregates all files in matrix_CNN that were processed with the same user-defined group name (-g) and generates one file per group for downstream processing.

From step 5 onwards, there are 2 paths: training and testing.

(I) Testing path:

We provided 3 trained models for testing. Users can used our models to detect Inosine and quantify editing rate on their own data.

NOTE: Models 1 and 2 was trained using H9 and Xenopus Laevis data. Model 3 was trained using editing sites in H9, Xenopus Laevis, HCT116, Mus musculus (Mouse) and synthetic RNAs.

(5) Transform 1D into 2D data + Label data (class 0, 1 and 2 for unmodified, Inosine and SNP AG)

Script(s):
S5.Transform_dim.sh (input:  $exptdir $numcore $agggrp $classref)
	└── s5.Preprocess_data_matrix_inputCNN.R
	
Output:
$(dirname $exptdir)/matrix_CNN/$agggrp.morefts.input_CNN_regression_modgen.RData

(6) Predict testing data using Dinopore models + Plot ROC and PR curves for the result

Script(s):
S6.Predict.sh (input: $exptdir $agggrp $numcore $classref)
	└── s6.Predict_test_data.R

Output:
$(dirname $exptdir)/matrix_CNN/$agggrp.output_prediction_CNN_class0.txt
$(dirname $exptdir)/matrix_CNN/$agggrp.output_prediction_CNN_class1.txt
$(dirname $exptdir)/matrix_CNN/$agggrp.output_prediction_CNN_class2.txt

(II) Training path:

Users can train models using their own data, based on our models' architecture. Depending on the dataset size, number of epochs, batch size and GPU type, time required for training will vary.

For our data: model 3 was trained using 265,631 positions with 893,865 matrices, on GPU NVIDIA GeForce RTX3080 with batch size = 1024. Time for 1 epoch is 88 seconds and there were 900 epochs.

NOTE 1: User should make sure the training data set is balance between 3 classes, especially class 0 (unmod) and class 1 (Inosine or other modifications). Number of matrices for training set should be at least 50,000 for model 1 and model 2 and at least 500,000 for model 3

NOTE 2: Ground truth for training path must have 5 columns: contig, position, strand (p for positive and n for negative), edit (0,1 and 2), and rate (editing rate 0-1 for class 1 and -1 for class 0 and 2) (For example, see Dinopore/code/misc/Example_ground_truth_training.txt)

NOTE 3: User should make sure number of epochs is at least 500 epochs for model 1 and model 2, and at least 900 epochs for model 3

NOTE 4: Scripts for training uses GPU. If users wish to use CPU for training, please remove the following lines of code from the respective scripts: s6a.Training_classification_model_3class.R (lines 58-59), s6b.Training_classification_model_2class.R (lines 58-59), s6c.Training_regression_model.R (lines 74-75).

NOTE 5: Training scripts should be run in the same directory as the input file (usually matrix_CNN). The output file will be generated in the same directory.

(5) Transform 1D into 2D data + Label data (class 0, 1 and 2 for unmodified, Inosine and SNP AG) + Split into training and validation/testing data set

Script(s):
Rscript s5.Train_preprocess_data_matrix_inputCNN_train_val.R -t $numcore -i $input -o $output -c $classref
where
	input=${groupName}.Agg.morefts.10bin.inML.txt
	output=${groupName}
	classref="groundtruth_training_file" (see NOTE 2 above)
	
Output:
${groupName}.validation_matrix.rds (used as $vali in step 6a-c)
${groupName}.training_matrix.rds (used as $train in step 6a-c)

(6a) Train 3-class classification model

Script(s):
Rscript s6a.Training_classification_model_3class.R -v $vali -t $train -o $groupName -e $epoch -b $batch -s $seed
where
	vali=${groupName}.validation_matrix.rds
	train=${groupName}.training_matrix.rds
	
Output:
${groupName}.best_pos5_mix_3class_resnet.h5

(6b) Train 2-class classification model

Script(s):
Rscript s6b.Training_classification_model_2class.R  -v $vali -t $train -o $groupName -e $epoch -b $batch -s $seed
	
Output:
${groupName}.best_pos5_mix_3c_1vs1_resnet.h5

(6c) Train regression model

Script(s):
Rscript s6c.Training_regression_model.R -v $vali -t $train -o $groupName -e $epoch -b $batch -s $seed
	
Output:
${groupName}.best_regression_morefts_16384_1024_asin06.h5

(7) Predict testing data using trained models in step 6a, 6b and 6c

Script(s):
Rscript s7.Predict_test_data_using_trained_models.R -i $input -t  $numcore -m $model1 -M $model2 -r $model3
where
	input=${groupName}.validation_matrix.rds
	model1=${groupName}.best_pos5_mix_3class_resnet.h5
	model2=${groupName}.best_pos5_mix_3c_1vs1_resnet.h5
	model3=${groupName}.best_regression_morefts_16384_1024_asin06.h5

Output:
${groupName}.output_prediction_CNN_class0.txt
${groupName}.output_prediction_CNN_class1.txt
${groupName}.output_prediction_CNN_class2.txt

dinopore's People

Contributors

darelab2014 avatar hengjwj avatar tramanh183 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

dinopore's Issues

confused with event_length * 3012

Hi @darelab2014,
Firstly thanks for developing such an awesome tool. I have been using nanopolish eventalign for most of my work with Nanopore data.
Recently I read Dinopore source code from s2.Combine_raw_nanopolish.R script, I was confused that why event_length multiped by 3012? In my understand, the event_length is dwelling time of 5-mer stay in pore. Can you give me some explained?

  dat=dat[dat$strand=="t" & dat$reference_kmer != "NNNNN" & dat$event_stdv<50,]
  dat[,"count"]=round(3012*dat$event_length)

Thanks a lot,
Bai

classref file

Hi,
Thanks for developing Dinopore. I want to use Dinopore to quantify the A-to-I editing events in my nanopore data, so I followed the steps to process the data. In step 5 and step 6 of testing path, classref file is required, and I don't know what this classref file is . In the training path step5, note 2 says classref file is the groundtruth_training_file. I want to use Dinopore to predict the edit rate of my data, so I don't have the ground truth file. How could I use Dinopore? Thanks.

Best regards,
Chujie

How can I create my own groundtruth file

Hello,

I've been trying to use dinopore from a while. I want to identify A-to-I RNA editing sites from my direct RNA reads.

In the Script #5 one of the parameters is the classref file, which is the groundtruth_class_regression.tsv. However, is this file only meant to be used for the test data set? or should I used it also when running my own data?

If not, how can I generate my own groundtruth_class_regression.tsv file?

Raw data not available on SRA

Dear authors,

I was hoping to re-basecall and analyse the read data in Dinopore. However, the SRP record SRP363295 as provided in the paper has many empty entries, e.g. the mouse and frog ones.

Could you please take a look into this? Thanks for the great work!

Error in py_call_impl when executing s6.Predict_test_data.R

Hello,
I was able to run Dinopore up to the script#6 in accordance with your instructions.When my data reaches s6.Predict_test_data.R
in manuscript2.sh I'm getting this error and the process gets killed:

Error in py_call_impl(callable, call_args$unnamed, call_args$named) : 
  TypeError: Error when deserializing class 'Add' using config={'name': 'add_30', 'trainable': None, 'dtype': 'float32'}.
Exception encountered: Expected `trainable` argument to be a boolean, but got: None
Run `reticulate::py_last_error()` for details.
Calls: load_model_hdf5 ... load_model -> do.call -> <Anonymous> -> py_call_impl
Execution halted

This is my command line code:

bash /nfs_data/liumy/01.project_sg-nex/results/dinoporetest_1/Dinopore-main/code/mainscript2.sh -e  /nfs_data/liumy/01.project_sg-nex/results/dinoporetest_1/ -r /nfs_data/liumy/01.project_sg-nex/results/isoquanttest/3bamsresult/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -n 32 -g fast5_pass -c /nfs_data/liumy/01.project_sg-nex/results/dinoporetest_1/Dinopore-main/code/misc/Example_classref.tsv

Thank you

Error in py_call_impl S6

Hello,

I was able to run Dinopore up to the script#6. However, when my data reaches s6.Predict_test_data.R

I'm getting this error and the process gets killed:

Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: Expect x to be a non-empty array or dataset.

Detailed traceback:
File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1644, in predict
raise ValueError('Expect x to be a non-empty array or dataset.')
Calls: predict ... predict.keras.engine.training.Model -> do.call -> -> py_call_impl
Execution halted

Please if someone can help me troubleshooting I'll appreciate it.

Thank you
Camila

Missing groundtruth classification file

Hi, the usage of the program requires a -c argument with the ground truth classification. The example for running mainscript1.sh shows

-c /data/xen_s9_r1_50k/groundtruth_classification.txt

I haven't been able to find this file anywhere, including in the CloudOcean capsule. Can you provide this file?

ground truth file

Hi, thanks for developing Dinopore. I want the train the Dinopore model from scratch. And I found the groundtruth_class_regression.tsv and the groundtruth_class_regression_all.tsv in codeocean. And these two ground truth file only contain 3 columns, id, ref, rate, not the 5 columns mentioned in the README. Example file in misc on github is correct. Could you provide the whole ground truth file? Thanks!

other rawdata of dRNA sequencing for synthetic RNA

Hi developer! I only found rawdata of Direct RNA Nanopore sequencing of synthetic RNA for only pureI and pureG. However, in your paper, there are other ratio shown in Fig 1e. Have your ever upload these data elsewhere? I am intersect in your work, and want to re-analyze the rawdata. Any help will be greatly appreciate!

image

S4 memory requriements?

How much memory do I need to run the S4 R script with a 70GB input file? The reading in of the file seems to be running out my RAM, or the foreach/doParallel stuff. Not sure really.

Noah

Where are the models?

Hi,
I'm at step 6 but where are the pretrained models? There is no models folder in the repo...

Noah

EDIT: found them in the capsule

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.