nanoporetech / pipeline-nanopore-denovo-isoforms Goto Github PK

Pipeline for de novo clustering of long transcriptomic reads

License: Other

Python 92.40% Shell 7.60%

cdna rna rna-seq transcriptomics

pipeline-nanopore-denovo-isoforms's Introduction

We have a new bioinformatic resource that largely replaces the functionality of this project! See our new repository here: https://github.com/epi2me-labs/wf-isoforms

This repository is now unsupported and we do not recommend its use. Please contact Oxford Nanopore: [email protected] for help with your application if it is not possible to upgrade to our new resources, or we are missing key features.

Pipeline for de novo clustering of long transcriptomic reads

The first natural step in the de novo analysis of long transcriptomic data in the absence of a reference genome is the clustering of the reads into groups corresponding to gene families. This pipeline performs that task using the isONclust2 tool, which is based on the approach pioneered by isONclust using minimizers and occasional pairwise alignment.

Since isONclust2 is implemented in C++ using efficient data structures, it can distribute computing across multiple cores and machines, thus it is able to cope with large transcriptomic datasets generated using PromethION P24 and P48 flow cells.

The pipeline optionally concatenates FASTQ files in the MinKNOW/guppy output and performs the optional trimming and orientation of cDNA reads using pychopper.

The main output of the pipeline is the assignment of read identifiers to gene clusters and the clustered reads grouped into one FASTQ file per gene cluster. This output is fit for downstream, gene level analysis.

Getting Started

Input

The input is a file with fastq records or a directory of containing fastq files, which is specified is config.yml.

Output

The main output files generated by the pipeline are under output_directory/final_clusters:

batch_info.tsv - information on the final output batch
cluster_cons.fq - a fastq file with the cluster representatives (or consensuses)
cluster_fastq/ - a directory of fastq files (one per gene cluster)
clusters_info.tsv - A TSV file with the cluster sizes
clusters.tsv - A TSV file with read identifiers assigned to clusters

Dependencies

miniconda
The rest of the dependencies are installed via conda.
CPU with AVX/SSE extensions; the workflow has been validated using the GridION device.

Installation

Clone the pipeline and the pipeline toolset by issuing:

git clone --recursive https://github.com/nanoporetech/pipeline-nanopore-denovo-isoforms.git

Install the dependencies using conda into a new environment:

conda env create -f env.yml

Activate the conda environment:

conda activate denovo-isoforms

Usage

Edit config.yml to set the input fastq and parameters, then on a local machine issue:

snakemake -j <num_cores> all

For analysing larger datsets (e.g. a PromethION flow cell) it is advisable to run the pipeline on a SGE cluster through DRMAA:

snakemake --rerun-incomplete -j 1000 --latency-wait 600 --drmaa-log-dir sge_logs --drmaa ' -P project_name -V -cwd -l h_vmem=200G,mem_free=155G -pe mt 5' all

Results

The evaluation metrics (also described in the isONclust paper - free version) reported are:

The clustering has a binary classification issue (i.e. a single read is either correctly or incorrectly "labelled" by the algorithm given a ground truth). Each read must instead be evaluated in relation to the reads in the same and other clusters (e.g. which pairs or reads are "correctly assigned to the same cluster?" and "erroneously assigned to different clusters?"). From this, common measures such as precision, recall, and F-score cannot be used. The Homogeneity, completeness, and V-measure are analogous to the precision, recall, and F-score measures for binary classification issues, but are adapted for clustering issues.

Intuitively, homogeneity (i.e. precision) penalizes over-clustering, i.e. wrongly clustering together reads, while completeness (i.e. sensitivity) penalizes under-clustering, i.e. mistakenly keeping reads in different clusters. The V-measure is then defined as the mean of homogeneity and completeness (i.e. F-measure). We also include the commonly used Adjusted Rand Index (ARI). Intuitively, ARI measures the percentage of read pairs correctly clustered, normalized so that a perfect clustering achieves an ARI of 1 and a random cluster assignment achieves an ARI of 0. Briefly, both of these clustering quality metrics are derived from computing pairwise correct and incorrect groupings of reads, instead of individually classifying a read as correct or incorrect (as in classification issues).

Performance on PCS109 SIRV data

The performance on ~19k SIRV E0 reads generated using the PCS109 protocol can be assesed by running the evaluation script:

./run_evaluation.sh

The main results are:

Performance on PCS109 Drosophila melanogaster data

The performance on a D. melanogaster datasets generated using the SQK-PCS109 protocol can be assesed by running the evaluation script:

./run_evaluation_dmel.sh

The main results are:

Acknowledgements

This software was built in collaboration with Kristoffer Sahlin and Paul Medvedev.

Licence and Copyright

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/.

FAQs and tips

References and Supporting Information

See the post announcing transcriptomics tools in the Nanopore Community here.

Research Release

Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies.

pipeline-nanopore-denovo-isoforms's People

Contributors

Stargazers

Watchers

Forkers

wolongac priyalakr botond-sipos lucapandolfini dilerhaji wenmm

pipeline-nanopore-denovo-isoforms's Issues

Error in the clustering step with isONclust2 cluster

Hello,
We are trying to apply this pipeline to our set of reads (aprox. 7 million cDNA reads from a transcriptomic experiment).
We succesfully reach the clustering step (batches step works well, obtaining 16 isONbatch files).
However, after doing this, the script fails with this output error:

Error in rule cluster_job_3:
jobid: 20
output: clusters/isONcluster_3.cer
shell:
isONclust2 cluster -x sahlin -v -Q -l sorted/batches/isONbatch_3.cer -o clusters/isONcluster_3.cer -z; sync
(exited with non-zero exit code)

Then, our computer crashes and, after we stop the process, we obtain the same error for the other batches.
I attach the logfile of the script, and both the Snakefile and configfile.
I hope you can help us.
Thank you in advance.

logfile script.txt
config.txt
Snakefile.txt

mcore number in command vs config

Hi,

I wonder how does the num_cores value given in the command

snakemake -j <num_cores> all

relates to the built-in value present in the config.yaml file

## Pipeline-specific parameters:
cores: 20

When using 80 threads, should I only adapt the first command or also adapt the yaml?
Should the yaml count be a divider of the total thr count, eg 80 & 20 allowing four 'batches' of 20 ?
Finally, how many GB of RAM should be present for each declared thread for the pipeline to run optimally?

Thanks in advance

input fasta file

Hi there!

I've performed on my ONT RNA-seq dataset an hybrid correction and I would like to perform the clustering with your pipeline. As input for the clustering, I have only a fasta file containing the corrected reads.
Could I convert the fasta file into a fastq or this conversion could invalidate the analysis?

Best regards,
Giulia

How to set the "consensus_period" parameter?

Hi, Friend! Could you tell some detailed info about how to set the parameter "consensus_period" in the config.yml file? The result seemed not to do consenesus of the clusetered sequences by default, and i want kown how to choose the "best one" in the pipline and how to do consensus and by what algorithm?

pipeline breaking with KeyError in line 184 of pipeline-nanopore-denovo-isoforms/Snakefile: 'consensus_minimum'

Almost there then ... crash!
thanks for any help fixing this.

pipeline-nanopore-denovo-isoforms$ snakemake -j 44 all
Preprocessing read in fastq file: /data/NC_projects_GridION/Flongle/reads.fq
Concatenating reads under directory: /data/NC_projects_GridION/Flongle/reads.fq
Running pychopper of fastq file: processed_reads/input_reads.fq
Using kit: PCS109
Configurations to consider: "+:SSP,-VNP|-:VNP,-SSP"
Counting fastq records in input file: input_reads.fq
Total fastq records in input file: 121793
Tuning the cutoff parameter (q) on 9978 sampled reads (8.2%) passing quality filters (Q >= 7.0).
Optimizing over 30 cutoff values.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [02:32<00:00,  5.10s/it]
Best cutoff (q) value is 2.414 with 11% of the reads classified.
Processing the whole dataset using a batch size of 6089:
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 121793/121793 [00:44<00:00, 2715.70it/s]
Finished processing file: input_reads.fq
Input reads failing mean quality filter (Q < 7.0): 0 (0.00%)
Output fragments failing length filter (length < 50): 0
Detected 2 potential artefactual primer configurations:
Configuration	NrReads	PercentReads
VNP,-VNP	8444	6.93%
SSP,SSP 	6614	5.43%
-----------------------------------
Reads with two primers:	11.17%
Rescued reads:		1.49%
Unusable reads:		87.35%
-----------------------------------
Counting records in input fastq: processed_reads/full_length_reads.fq
Bases in input: 8 megabases
Batch size is: 406
KeyError in line 184 of /opt/biotools/pipeline-nanopore-denovo-isoforms/Snakefile:
'consensus_minimum'
  File "/opt/biotools/pipeline-nanopore-denovo-isoforms/Snakefile", line 184, in <module>

Consensus parameters giving worse sequence

Hello,

When I use pychopper and consensus parameters, the sequence seems worse in all six frames somehow (see below). I understand the cluster sequences can't be compared one-to-one, these are just examples. I find a similar worse sequence pattern in other sequences in the consensus file.

Changed Parameters:
Process cDNA reads using pychopper, turn off for direct RNA:
run_pychopper: true
Options passed to pychopper:
pychopper_opts: "-k PCS109 -r report.pdf -A aln_hits.bed -S statistics.tsv -u unclassified.fq -w rescued.fq"

Consensus period (-1 means no consensus):
consensus_period: 500
Minimum consensus sample size:
consensus_minimum: 50
Maximum consensus sample size:
consensus_maximum: 150

@cluster_0 origin=cons_0_41:1 length=749 size=17279
GGGGTCATACTAAGCTATTCGGCTAGTTTTAATAGTCAACTAACAAATATACGGGACACGGGTATACGGTTAATCATCCTTGGCTAAATCCCCGCTTACAATATCGAGCAAATGTGTAATACATACATATGCATAAAATTATATTTGGATTGTTTGGCGTGACTTTATTAATATATATTAAAATAGTATCACCATTTTGATAAAATTCGTGATTATTTCCGGTTGCTACTATCGTAATTCAAAATGTTTCGCAACCATGTCTAATGCATGTGTATAAAATATTTTGTATATAAAGCGGTATTCTTCTGCTGATGGGGATCAAACCAAATTCATCTGCAAAATGAACTTCTACTATATCCTTCAAGCTATTACCGTCGCTGTGCTCTTCGTAGCTGCAGCCCAGGGCGGTGGTGATGGAAACCCTCCTCCAGCCTAATTCCAGCAATGGGAAATTGTGTTTCAAATATTAATTTTTGACATACAAAATATTAAATACCATTGACAAAATGCAAAATAAATGAGCTCAACTTAACATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCTCAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

With Default Settings, I get the below sequence.
Process cDNA reads using pychopper, turn off for direct RNA:
run_pychopper: false
Options passed to pychopper:
pychopper_opts: ""

Consensus period (-1 means no consensus):
consensus_period: -1
Minimum consensus sample size:
consensus_minimum: 50
Maximum consensus sample size:
consensus_maximum: -150

@cluster_0 origin=rep_6_989:1 length=999 size=17636
CCGTGACAAGAAAGTTGTCGGTGTCTTTGTGTTTCTGTTGGTGCTGATATTGCTGGGATCAAACCAAATTCATCTGCAAAATGAACTTCTACTATATCCTTCAAGCTATTACCGTCGCTGTGCTCTTCGTAGCTGCAGCCCAGGGCGGTGGTGATGGAAACCCTCCTCCAGCCTAATTCCAGCAATGGGAAATTGTGTTTCAAATATTAATTTTTGACATACAAAATATTAAATACCATTGACAAAATGCAAAATAAATGAGCTCAACTCAAAAAAAAAAAAAAAAAAAAAAAAGAAGATAGAGCGACAGGCAAGTCACAAAGACACCGACAACTTTCTTGTCGTTTCCAGTATGCTTCGTTCGTTTCAGTGGTGTTTATGATCCATCATCTACCGTGACAAGAGGATTGTCGGTGTCTTTGTGACTTGCCTGTCGCTCTATCTTCTCTTTTTTTAAAAAAACGCAAAAGCCACTTGAAATTTATTATTTCTAATGCATTTAGGGACTGATCTCCGTAGAGACATGATCATCTTCACCTTTGCAAGAGATAGTTATTTTTATTGGTGCTATCTCCGGATAGATAACCACATTGCACTCCTTTCACGCCCGTCATTTCGAAGAATTGCGTCATAGTAGATCCCGTCAAAAAATCAAAACTGCCAATACTAGTCACTTTGACAACAGTGTATTTGGGACCATCGCCAGCAGCCAGTTTGACCAAAGATGTTGTGAACAACTTCATAGCCTCTTCCCTCTTTGCGCCCTCTATTTCTTCCGGTATAATGTCACCGAAAGCCTTAACCGACATCATGACGATAACAGCCACCAACAAAACAATCGATATCCAAGGTGAACATTGTGAGGCAACTCTTCGCAGACGTTTGGGATAACACGATCACAAGGATTTAATAATTCTGAGTTTACCGCACACAGAATTACATATCCCCAGCAATATCAGCACCAACAGAAACACAAAGACACCGACAACTTTCTTGTCA

I'd also like to know, with default settings (consensus_period: -1), what exactly is being output to the cluster_cons.fq file if it is not the consensus of the clusters?

Thanks.

Running error at first time instalation

I tried to run ./run_evaluation.sh but error occurs during the process :
i change input file name, but no success to run

(denovo-isoforms) dedenmatra@dmatra:~/pipeline-nanopore-denovo-isoforms$ ./run_evaluation.sh
SyntaxError:
Input and output files have to be specified as strings or lists of strings.
File "/home/dedenmatra/pipeline-nanopore-denovo-isoforms/Snakefile", line 21, in
File "/home/dedenmatra/pipeline-nanopore-denovo-isoforms/snakelib/utils.snake", line 18, in
[M::mm_idx_gen::0.0201.07] collected minimizers
[M::mm_idx_gen::0.0262.64] sorted minimizers
[M::main::0.0262.64] loaded/built the index for 7 target sequence(s)
[M::mm_mapopt_update::0.0292.50] mid_occ = 14
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 7
[M::mm_idx_stat::0.0302.41] distinct minimizers: 40410 (62.16% are singletons); average occurrences: 1.876; average spacing: 2.942; total length: 223019
[M::worker_pipeline::8.0128.83] mapped 19730 sequences
[M::main] Version: 2.22-r1101
[M::main] CMD: minimap2 -ax splice -t 10 evaluation/data/SIRV_150601a.fasta evaluation/data/SIRV_PCS109_phmm_fl.fq
[M::main] Real time: 8.019 sec; CPU: 70.741 sec; Peak RSS: 0.802 GB
[bam_sort_core] merging from 0 files and 2 in-memory blocks...

Traceback (most recent call last):
File "/home/dedenmatra/pipeline-nanopore-denovo-isoforms/./scripts/compute_cluster_quality.py", line 467, in
main(args)
File "/home/dedenmatra/pipeline-nanopore-denovo-isoforms/./scripts/compute_cluster_quality.py", line 345, in main
clusters = parse_inferred_clusters_tsv(args.clusters, args)
File "/home/dedenmatra/pipeline-nanopore-denovo-isoforms/./scripts/compute_cluster_quality.py", line 18, in parse_inferred_clusters_tsv
infile = open(tsv_file, "r")
FileNotFoundError: [Errno 2] No such file or directory: 'evaluation/pipeline-isONclust2_SIRV_E0/final_clusters/clusters.tsv'

mostly unused reads in cdna_classifier_report.pdf

Unless I misunderstand the report created in 'processed reads' I have a problem.
The data below comes from an exploratory run using a single flongle. When this run gives good data, a regular flow-cell will be run, but I want to explore the analysis before going there.

Is there something intrinsic that makes the pipeline not compatible with Flongle data (SQK-DCS109) ?

batch_info.tsv

Name	Value
BatchNumber	0
BatchStart	0
BatchEnd	17825
Depth	5
NrBases	7607732
NrClusters	883
NrNontrivialCls	650
MinDBsize	0

head of cdna_classifier_report.tsv

Category	Name	Value
ReadStats	PassReads	121497.0
ReadStats	LenFail	0.0
ReadStats	QcFail	296.0
Classification	Primers_found	17826.0
Classification	Rescue	4506.0
Classification	Unusable	101429.0
Strand	+	9047.0
Strand	-	8779.0
RescueStrand	+	2208.0
RescueStrand	-	2298.0
UnclassHitNr	0	61631.0
UnclassHitNr	2	25975.0
UnclassHitNr	3	9456.0
UnclassHitNr	4	2942.0
UnclassHitNr	5	996.0
UnclassHitNr	6	297.0
UnclassHitNr	7	93.0
UnclassHitNr	8	32.0
UnclassHitNr	9	5.0
UnclassHitNr	11	1.0
UnclassHitNr	12	1.0
RescueHitNr	4	1942.0
RescueHitNr	5	1478.0
RescueHitNr	6	712.0
RescueHitNr	7	242.0
RescueHitNr	8	94.0
RescueHitNr	9	26.0
RescueHitNr	10	8.0
RescueHitNr	11	2.0
RescueHitNr	19	2.0
RescueSegmentNr	2	2220.0
RescueSegmentNr	3	22.0

cdna_classifier_report.pdf

I first ran the pipeline on the fastq obtained from a standard GridION run and found 87% of my reads unusable in the pychopper report.

I then read the pychopper page which suggest that the reads should be basecalled without trimming.

I then re-ran guppy with the following command:

guppy_basecaller -i ./fast5_pass \
  -s ./hacbc_out  \
  -c /opt/ont/guppy/data/dna_r9.4.1_450bps_hac.cfg \
   -x 'cuda:0' \
  --trim_strategy none

I got a bunch of new fastq files which I merged and fed to the pipeline in a new run (all parameters in config.yml unchanged)

I get again 87% Unusable reads

Did I miss something or is my guppy command incomplete?

Thanks for your help

my env if this can help:

# packages in environment at /opt/biotools/miniconda3/envs/pipeline-denovo-isoforms:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
aioeasywebdav             2.4.0           py38h32f6830_1001    conda-forge
aiohttp                   3.7.2            py38h1e0a361_0    conda-forge
amply                     0.1.4                      py_0    conda-forge
appdirs                   1.4.4              pyh9f0ad1d_0    conda-forge
async-timeout             3.0.1                   py_1000    conda-forge
atk                       2.36.0                        3    conda-forge
atk-1.0                   2.36.0               h63f31ab_3    conda-forge
attrs                     20.2.0             pyh9f0ad1d_0    conda-forge
bcrypt                    3.2.0            py38h1e0a361_1    conda-forge
boto3                     1.16.5             pyh9f0ad1d_0    conda-forge
botocore                  1.19.5             pyh9f0ad1d_0    conda-forge
brotlipy                  0.7.0           py38h8df0ef7_1001    conda-forge
bzip2                     1.0.8                h516909a_3    conda-forge
c-ares                    1.16.1               h516909a_3    conda-forge
ca-certificates           2020.6.20            hecda079_0    conda-forge
cachetools                4.1.1                      py_0    conda-forge
cairo                     1.16.0            h488836b_1006    conda-forge
certifi                   2020.6.20        py38h924ce5b_2    conda-forge
cffi                      1.14.3           py38h1bdcb99_1    conda-forge
chardet                   3.0.4           py38h924ce5b_1008    conda-forge
coincbc                   2.10.5               h71b4bd6_1    conda-forge
configargparse            1.2.3              pyh9f0ad1d_0    conda-forge
cryptography              3.1.1            py38hb23e4d4_1    conda-forge
cycler                    0.10.0                     py_2    conda-forge
datrie                    0.8.2            py38h1e0a361_1    conda-forge
dbus                      1.13.6               h7a60e0d_1    conda-forge
decorator                 4.4.2                      py_0    conda-forge
docutils                  0.16             py38h924ce5b_2    conda-forge
drmaa                     0.7.9                   py_1000    conda-forge
dropbox                   10.4.1             pyh9f0ad1d_0    conda-forge
expat                     2.2.9                he1b5a44_2    conda-forge
fftw                      3.3.8           nompi_hdcdd268_1112    conda-forge
filechunkio               1.8                        py_2    conda-forge
font-ttf-dejavu-sans-mono 2.37                 hab24e00_0    conda-forge
font-ttf-inconsolata      2.001                hab24e00_0    conda-forge
font-ttf-source-code-pro  2.030                hab24e00_0    conda-forge
font-ttf-ubuntu           0.83                 hab24e00_0    conda-forge
fontconfig                2.13.1            h1056068_1002    conda-forge
fonts-conda-ecosystem     1                             0    conda-forge
fonts-conda-forge         1                             0    conda-forge
freetype                  2.10.4               he06d7ca_0    conda-forge
fribidi                   1.0.10               h516909a_0    conda-forge
ftputil                   4.0.0                      py_0    conda-forge
gdk-pixbuf                2.38.2               h3f25603_6    conda-forge
gettext                   0.19.8.1          hf34092f_1004    conda-forge
ghostscript               9.53.3               he1b5a44_1    conda-forge
giflib                    5.2.1                h516909a_2    conda-forge
gitdb                     4.0.5                      py_0    conda-forge
gitpython                 3.1.11                     py_0    conda-forge
glib                      2.66.2               he1b5a44_0    conda-forge
gobject-introspection     1.66.1           py38he66682d_2    conda-forge
google-api-core           1.22.4             pyh9f0ad1d_0    conda-forge
google-api-python-client  1.12.5             pyh9f0ad1d_0    conda-forge
google-auth               1.22.0                     py_0    conda-forge
google-auth-httplib2      0.0.4              pyh9f0ad1d_0    conda-forge
google-cloud-core         1.4.3              pyh9f0ad1d_0    conda-forge
google-cloud-storage      1.31.2             pyh9f0ad1d_0    conda-forge
google-crc32c             1.0.0            py38h6d3b9ce_1    conda-forge
google-resumable-media    1.1.0              pyh9f0ad1d_0    conda-forge
googleapis-common-protos  1.52.0           py38h32f6830_0    conda-forge
graphite2                 1.3.13            he1b5a44_1001    conda-forge
graphviz                  2.42.3               h6939c30_2    conda-forge
grpcio                    1.31.0           py38h2c89da0_0    conda-forge
gst-plugins-base          1.14.5               h0935bb2_2    conda-forge
gstreamer                 1.14.5               h36ae1b5_2    conda-forge
gtk2                      2.24.32              h194ddfc_3    conda-forge
gts                       0.7.6                h17b2bb4_1    conda-forge
harfbuzz                  2.7.2                hb1ce69c_1    conda-forge
hmmer                     3.3.1                he1b5a44_0    bioconda
htslib                    1.11                 hd3b49d5_0    bioconda
httplib2                  0.18.1             pyh9f0ad1d_0    conda-forge
icu                       67.1                 he1b5a44_0    conda-forge
idna                      2.10               pyh9f0ad1d_0    conda-forge
imagemagick               7.0.10_28       pl526h201ca68_0    conda-forge
importlib-metadata        2.0.0                      py_1    conda-forge
importlib_metadata        2.0.0                         1    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
isonclust2                2.3                  hc9558a2_0    bioconda
jbig                      2.1               h516909a_2002    conda-forge
jinja2                    2.11.2             pyh9f0ad1d_0    conda-forge
jmespath                  0.10.0             pyh9f0ad1d_0    conda-forge
joblib                    0.17.0                     py_0    conda-forge
jpeg                      9d                   h516909a_0    conda-forge
jsonschema                3.2.0                      py_2    conda-forge
jupyter_core              4.6.3            py38h32f6830_2    conda-forge
k8                        0.2.5                he513fc3_0    bioconda
kiwisolver                1.2.0            py38hbf85e49_1    conda-forge
krb5                      1.17.1               hfafb76e_3    conda-forge
lcms2                     2.11                 hbd6801e_0    conda-forge
ld_impl_linux-64          2.35                 h769bd43_9    conda-forge
libblas                   3.9.0                2_openblas    conda-forge
libcblas                  3.9.0                2_openblas    conda-forge
libclang                  10.0.1          default_hde54327_1    conda-forge
libcrc32c                 1.1.1                he1b5a44_2    conda-forge
libcurl                   7.71.1               hcdd3856_8    conda-forge
libdeflate                1.6                  h516909a_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libevent                  2.1.10               hcdb4288_3    conda-forge
libffi                    3.2.1             he1b5a44_1007    conda-forge
libgcc-ng                 9.3.0               h5dbcf3e_17    conda-forge
libgfortran-ng            9.3.0               he4bcb1c_17    conda-forge
libgfortran5              9.3.0               he4bcb1c_17    conda-forge
libglib                   2.66.2               h0dae87d_0    conda-forge
libgomp                   9.3.0               h5dbcf3e_17    conda-forge
libiconv                  1.16                 h516909a_0    conda-forge
liblapack                 3.9.0                2_openblas    conda-forge
libllvm10                 10.0.1               he513fc3_3    conda-forge
libnghttp2                1.41.0               h8cfc5f6_2    conda-forge
libopenblas               0.3.12          pthreads_h4812303_1    conda-forge
libpng                    1.6.37               hed695b0_2    conda-forge
libpq                     12.3                 h5513abc_2    conda-forge
libprotobuf               3.13.0.1             h8b12597_0    conda-forge
librsvg                   2.50.1               h33a7fed_0    conda-forge
libsodium                 1.0.18               h516909a_1    conda-forge
libssh2                   1.9.0                hab1572f_5    conda-forge
libstdcxx-ng              9.3.0               h2ae2ef3_17    conda-forge
libtiff                   4.1.0                hc7e4089_6    conda-forge
libtool                   2.4.6             hebb1f50_1006    conda-forge
libuuid                   2.32.1            h14c3975_1000    conda-forge
libwebp                   1.1.0                h56121f0_4    conda-forge
libwebp-base              1.1.0                h516909a_3    conda-forge
libxcb                    1.13              h14c3975_1002    conda-forge
libxkbcommon              0.10.0               he1b5a44_0    conda-forge
libxml2                   2.9.10               h68273f3_2    conda-forge
lz4-c                     1.9.2                he1b5a44_3    conda-forge
markupsafe                1.1.1            py38h8df0ef7_2    conda-forge
matplotlib                3.3.2            py38h32f6830_1    conda-forge
matplotlib-base           3.3.2            py38h4d1ce4f_1    conda-forge
minimap2                  2.17                 hed695b0_3    bioconda
multidict                 4.7.5            py38h1e0a361_2    conda-forge
mysql-common              8.0.21                        2    conda-forge
mysql-libs                8.0.21               hf3661c5_2    conda-forge
nbformat                  5.0.8                      py_0    conda-forge
ncurses                   6.2                  he1b5a44_2    conda-forge
networkx                  2.5                        py_0    conda-forge
nspr                      4.29                 he1b5a44_1    conda-forge
nss                       3.58                 h27285de_1    conda-forge
numpy                     1.19.2           py38hf89b668_1    conda-forge
oauth2client              4.1.3                      py_0    conda-forge
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openjpeg                  2.3.1                h981e76c_3    conda-forge
openssl                   1.1.1h               h516909a_0    conda-forge
pandas                    1.1.3            py38hddd6c8b_2    conda-forge
pango                     1.42.4               h80147aa_5    conda-forge
paramiko                  2.7.2              pyh9f0ad1d_0    conda-forge
parasail-python           1.2              py38h8162308_2    bioconda
patsy                     0.5.1                      py_0    conda-forge
pcre                      8.44                 he1b5a44_0    conda-forge
perl                      5.26.2            h516909a_1006    conda-forge
pillow                    8.0.1            py38h9776b28_0    conda-forge
pip                       20.2.4                     py_0    conda-forge
pixman                    0.38.0            h516909a_1003    conda-forge
pkg-config                0.29.2            h516909a_1008    conda-forge
prettytable               0.7.2                      py_3    conda-forge
protobuf                  3.13.0.1         py38h950e882_1    conda-forge
psutil                    5.7.3            py38h8df0ef7_0    conda-forge
pthread-stubs             0.4               h14c3975_1001    conda-forge
pulp                      2.3.1            py38h32f6830_0    conda-forge
pyasn1                    0.4.8                      py_0    conda-forge
pyasn1-modules            0.2.7                      py_0    conda-forge
pychopper                 2.5.0                      py_0    bioconda
pycparser                 2.20               pyh9f0ad1d_2    conda-forge
pygments                  2.7.2                      py_0    conda-forge
pygraphviz                1.6              py38h25c7686_1    conda-forge
pynacl                    1.4.0            py38h1e0a361_2    conda-forge
pyopenssl                 19.1.0                     py_1    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pyqt                      5.12.3           py38ha8c2ead_4    conda-forge
pyqt5-sip                 4.19.18                  pypi_0    pypi
pyqtchart                 5.12                     pypi_0    pypi
pyqtwebengine             5.12.1                   pypi_0    pypi
pyrsistent                0.17.3           py38h1e0a361_1    conda-forge
pysam                     0.16.0.1         py38hbdc2ae9_1    bioconda
pysftp                    0.2.9                      py_1    conda-forge
pysocks                   1.7.1            py38h924ce5b_2    conda-forge
python                    3.8.6           h852b56e_0_cpython    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python-edlib              1.3.8.post1      py38hed8969a_2    bioconda
python-irodsclient        0.8.2                      py_0    conda-forge
python_abi                3.8                      1_cp38    conda-forge
pytz                      2020.1             pyh9f0ad1d_0    conda-forge
pyyaml                    5.3.1            py38h8df0ef7_1    conda-forge
qt                        5.12.9               h1f2b2cb_0    conda-forge
ratelimiter               1.2.0                   py_1002    conda-forge
readline                  8.0                  he28a2e2_2    conda-forge
requests                  2.24.0             pyh9f0ad1d_0    conda-forge
rsa                       4.6                pyh9f0ad1d_0    conda-forge
s3transfer                0.3.3                      py_3    conda-forge
samtools                  1.11                 h6270b1f_0    bioconda
scikit-learn              0.23.2           py38h519568a_1    conda-forge
scipy                     1.5.2            py38hd9480d8_2    conda-forge
seaborn                   0.11.0                        0    conda-forge
seaborn-base              0.11.0                     py_0    conda-forge
seqkit                    0.13.2                        0    bioconda
setuptools                49.6.0           py38h924ce5b_2    conda-forge
simplejson                3.17.2           py38h1e0a361_1    conda-forge
six                       1.15.0             pyh9f0ad1d_0    conda-forge
slacker                   0.14.0                     py_0    conda-forge
smmap                     3.0.4              pyh9f0ad1d_0    conda-forge
snakemake                 5.26.1                        1    bioconda
snakemake-minimal         5.26.1                     py_1    bioconda
sqlite                    3.33.0               h4cf870e_1    conda-forge
statsmodels               0.12.0           py38hab2c0dc_1    conda-forge
threadpoolctl             2.1.0              pyh5ca1d4c_0    conda-forge
tk                        8.6.10               hed695b0_1    conda-forge
toposort                  1.5                        py_3    conda-forge
tornado                   6.0.4            py38h1e0a361_2    conda-forge
tqdm                      4.51.0             pyh9f0ad1d_0    conda-forge
traitlets                 5.0.5                      py_0    conda-forge
typing-extensions         3.7.4.3                       0    conda-forge
typing_extensions         3.7.4.3                    py_0    conda-forge
uritemplate               3.0.1                      py_0    conda-forge
urllib3                   1.25.11                    py_0    conda-forge
wheel                     0.35.1             pyh9f0ad1d_0    conda-forge
wrapt                     1.12.1           py38h1e0a361_1    conda-forge
xmlrunner                 1.7.7                      py_0    conda-forge
xorg-kbproto              1.0.7             h14c3975_1002    conda-forge
xorg-libice               1.0.10               h516909a_0    conda-forge
xorg-libsm                1.2.3             h84519dc_1000    conda-forge
xorg-libx11               1.6.12               h516909a_0    conda-forge
xorg-libxau               1.0.9                h14c3975_0    conda-forge
xorg-libxdmcp             1.1.3                h516909a_0    conda-forge
xorg-libxext              1.3.4                h516909a_0    conda-forge
xorg-libxpm               3.5.13               h516909a_0    conda-forge
xorg-libxrender           0.9.10            h516909a_1002    conda-forge
xorg-libxt                1.1.5             h516909a_1003    conda-forge
xorg-renderproto          0.11.1            h14c3975_1002    conda-forge
xorg-xextproto            7.3.0             h14c3975_1002    conda-forge
xorg-xproto               7.0.31            h14c3975_1007    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
yaml                      0.2.5                h516909a_0    conda-forge
yarl                      1.6.2            py38h1e0a361_0    conda-forge
zipp                      3.4.0                      py_0    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.4.5                h6597ccf_2    conda-forge

NameError Name Directory is not defined.

After installing all modules in the env.yaml the pipeline stops after this error unsure where it originates from.

NameError in line 200 of /pipeline-nanopore-denovo-isoforms/Snakefile:
name 'directory' is not defined
File "pipeline-nanopore-denovo-isoforms/Snakefile", line 200, in

Creating a consensus fasta file with unique transcripts

Hello,
Is it possible to create a de novo transcriptome fasta file with unique transcripts, with this pipeline? I would like to use it to further map other samples against this reference transcriptome.
Thanks in advance for your help.

suggestion, add create new env before installing all conda packages

I did without thinking and installed all the packages in my base env instead of in a fresh env
could you please suggest this in the manual for other idiots like me ;-)
thanks

keyError: consensus_minimum

Hello,

The config.yml seems to lack a "consensus_minimum" key, according to the logfile. Is it really the case ? What are the allowed values ?
Thanks in advance !
P.S : attached is the logfile.
slurm-1382776.txt

'Workflow' object has no attribute 'overwrite_configfiles'

I edited in config.yml to point to my fastq (no other edits)

# cDNA or direct RNA reads in fastq format
reads_fastq: "/data/NC_projects_GridION/Flongle/reads.fq"

config.yml.txt

then when I run (I have 88 cores) I get

snakemake -j 88 all
AttributeError in line 12 of /opt/biotools/pipeline-nanopore-denovo-isoforms/Snakefile:
'Workflow' object has no attribute 'overwrite_configfiles'
  File "/opt/biotools/pipeline-nanopore-denovo-isoforms/Snakefile", line 12, in <module>

same error with snakemake -j 20 all

Thanks in advance for your help

Generating a consensus fasta file with unique transcripts

large dataset fails to analyze

Dear,

I have been running smaller subsets of this data without issues until I merged the full data and ran it all together (26293588 reads).

I ran several times and get the following error independently of the number of threads I use.

In other pipelines, I remember issues with the /tmp partition, could this be the case here? (can I set some ENV variable to write the tmp files to another location to check this?)

My result partition has space and I used lately only 20 of my 88 threads (520GB RAM) and still it crashed.

Thanks for help

snakemake -j 40 all

>cat 2020-12-15T174130.768880.snakemake.log
CalledProcessError in line 186 of /opt/biotools/pipeline-nanopore-denovo-isoforms/Snakefile:
Command 'set -euo pipefail;  rm -fr clusters sorted
            mkdir -p sorted; isONclust2 sort  --batch-size 860954 --kmer-size 11 --window-size 15 --min-shared 5 --min-qual 7.0                         --mapped-threshold 0.7 --aligned-threshold 0.4 --min-fraction 0.8 --min-prob-no-hits 0.1 -M -1 -P -1 -g 50 -c -150 -F 2  -v -o sorted processed_reads/full_length_reads.fq;
            mkdir -p clusters;' returned non-zero exit status 139.
  File "/opt/biotools/pipeline-nanopore-denovo-isoforms/Snakefile", line 186, in <module>

and my config is

---
## General pipeline parameters:

# Name of the pipeline:
pipeline: "pipe_20201215_final-merged"

# ABSOLUTE path to directory holding the working directory:
workdir_top: "/data/3551_ELacchini_Saponaria_Isoforms/3551_ELacchini_merged_final"

# Repository URL:
repo: "https://github.com/nanoporetech/pipeline-isONclust2.git"

## Pipeline-specific parameters:
cores: 20

# cDNA or direct RNA reads in fastq format
reads_fastq: "/data/3551_ELacchini_Saponaria_Isoforms/3551_ELacchini_merged_final/reads/merged-final.fastq"

# The path above is a directory, find and concatenate fastq files:
concatenate: false

# Process cDNA reads using pychopper, turn off for direct RNA:
run_pychopper: true

# Options passed to pychopper:
pychopper_opts: ""

# Batch size in kilobases (if -1 then it is calculated based on the number of cores and bases):
batch_size: -1

# Maximum sequences per input batch (-1 means no limit):
batch_max_seq: -1

# Clustering mode:
cls_mode: "sahlin"

# Kmer size:
kmer_size: 11

# Window size:
window_size: 15

# Minimum cluser size in the left batch:
min_left_cls: 2

# Consensus period (-1 means no consensus):
consensus_period: -1

# Minimum consensus sample size:
consensus_minimum: 50

# Maximum consensus sample size:
consensus_maximum: -150

# Minimum number of minimizers shared between read and cluster:
min_shared: 5

# Minimum average quality value:
min_qual: 7.0

# Minmum mapped fraction of read to be included in cluster:
mapped_threshold: 0.7

# Minimum aligned fraction of read to be included in cluster:
aligned_threshold: 0.4

# Minimum fraction of minimizers shared compared to best hit, in order to continue mapping:
min_fraction: 0.8

# Minimum probability for i consecutive minimizers to be different between read and representative:
min_prob_no_hits: 0.1

nanoporetech / pipeline-nanopore-denovo-isoforms Goto Github PK

pipeline-nanopore-denovo-isoforms's Introduction

Pipeline for de novo clustering of long transcriptomic reads

Getting Started

Input

Output

Dependencies

Installation

Usage

Results

Performance on PCS109 SIRV data

Performance on PCS109 Drosophila melanogaster data

Acknowledgements

Licence and Copyright

FAQs and tips

References and Supporting Information

Research Release

pipeline-nanopore-denovo-isoforms's People

Contributors

Stargazers

Watchers

Forkers

pipeline-nanopore-denovo-isoforms's Issues

Recommend Projects

Recommend Topics

Recommend Org