nanoporetech / pipeline-transcriptome-de Goto Github PK

Pipeline for differential gene expression (DGE) and differential transcript usage (DTU) analysis using long reads

License: Other

Python 50.64% R 49.36%

rna-seq cdna differential-gene-expression differential-transcript-usage nanopore

pipeline-transcriptome-de's Introduction

This project is deprecated. Please see our newer wf-transcriptomes, which contains functionality for differential expression.

Pipeline for differential gene expression (DGE) and differential transcript usage (DTU) analysis using long reads

This pipeline uses snakemake, minimap2, salmon, edgeR, DEXSeq and stageR to automate simple differential gene expression and differential transcript usage workflows on long read data.

If you have paired samples (e.g for example treated and untreated samples from the same individuals) use the paired_dge_dtu branch.

Getting Started

Input

The input files and parameters are specified in config.yml:

transcriptome - the input transcriptome.
annotation - the input annotation in GFF format.
control_samples - a dictionary with control sample names and paths to the fastq files.
treated_samples - a dictionary with treated sample names and paths to the fastq files.

Output

alignments/*.bam - unsorted transcriptome alignments (input to salmon).
alignments_sorted/*.bam - sorted and indexed transcriptome alignments.
counts - counts generated by salmon.
merged/all_counts.tsv - the transcript count table including all samples.
merged/all_counts_filtered.tsv - the transcript count table including all samples after filtering.
merged//all_gene_counts.tsv - the gene count table including all samples.
de_analysis/coldata.tsv - the condition table used to build model matrix.
de_analysis/de_params.tsv - analysis parameters generated from config.yml.
de_analysis/results_dge.tsv and de_analysis/results_dge.pdf- results of edgeR differential gene expression analysis.
de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv and de_analysis/results_dtu.pdf - results of differential transcript usage by DEXSeq.
de_analysis/results_dtu_stageR.tsv - results of the stageR analysis of the DEXSeq output.
de_analysis/dtu_plots.pdf - DTU results plot based on the stageR results and filtered counts.

Dependencies

miniconda - install it according to the instructions.
snakemake install using conda.
pandas - install using conda.
The rest of the dependencies are automatically installed using the conda feature of snakemake.

Layout

README.md
Snakefile - master snakefile
config.yml - YAML configuration file
snakelib/ - snakefiles collection included by the master snakefile
lib/ - python files included by analysis scripts and snakefiles
scripts/ - analysis scripts
data/ - input data needed by pipeline - use with caution to avoid bloated repo
results/ - pipeline results to be commited - use with caution to avoid bloated repo

Installation

Clone the repository:

git clone https://github.com/nanoporetech/pipeline-transcriptome-de.git

Usage

Edit config.yml to set the input datasets and parameters then issue:

snakemake --use-conda -j <num_cores> all

Help

Licence and Copyright

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/.

References and Supporting Information

This pipeline is largely based on the approach described in the following paper:

Love MI, Soneson C and Patro R. Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification. F1000Research 2018, 7:952 (doi: 10.12688/f1000research.15398.3)

See the post announcing the tool at the Oxford Nanopore Technologies community here.

pipeline-transcriptome-de's People

Contributors

Stargazers

Watchers

pipeline-transcriptome-de's Issues

DTU transcript.tsv contains adjusted p.values?

Is the output DTU transcript.tsv file the p.values or adjusted p.values? Is there a standard threshold to apply for statistically significant DTU entries?

A question about setting of config.yml

Hi Botond,

I was in a trouble when I tried to run sankefile like this:
$ snakemake --use-conda -j 4 all
Error Report:

 Building DAG of jobs...
 MissingInputException in line 23 of /home/weir/software/pipeline-transcriptome-de/Snakefile:
 Missing input files for rule build_minimap_index:
 Araport11_genes.201606.cdna.fasta

Attached my config.yml

---
## General pipeline parameters:

# Name of the pipeline:
pipeline: "Nanopore-DES-Analysis"
# ABSOLUTE path to directory holding the working directory:
workdir_top: "/home/weir/output/"
# Results directory:
resdir: "/home/weir/output/Nanopore-DEG-Analysis/result_DEG"
# Repository URL:
repo: "https://github.com/nanoporetech/pipeline-transcriptome-de"

## Pipeline-specific parameters:

transcriptome: "Araport11_genes.201606.cdna.fasta"
annotation: "Araport11_GFF3_genes_transposons.gtf"

control_samples:
    Col0_1: "col0_1.fastq"
    Col0_2: "col0_2.fastq"
    Col0_3: "col0_3.fastq"

treated_samples:
    fip37_1: "fip37_1.fastq"
    fip37_2: "fip37_2.fastq"
    fip37_3: "fip37_3.fastq"

minimap_index_opts: ""

minimap2_opts: "-ax splice -uf -k14 ref.fa direct-rna.fq"

maximum_secondary: 100

secondary_score_ratio: 1.0

salmon_libtype: "SF"

# Count filtering options - customize these according to your experimental design:

# Genes expressed in minimum this many samples:
min_samps_gene_expr: 3
# Transcripts expressed in minimum this many samples:
min_samps_feature_expr: 1
# Minimum gene counts:
min_gene_expr: 10
# Minimum transcript counts:
min_feature_expr: 3


threads: 8

The transcripts and annotations were downloaded from:(https://www.arabidopsis.org/)
minimap2 --version :2.15-r905
snakemake was installed by conda(4.6.7).
All my input files are exist in /home/weir/output/Nanopore-DEG-Analysis/
In addition,I successfully tried manually indexing ref.fa by minimap2 command: minimap2 -d ref.mmi ref.fa

What is different?

What is different between this version and the default pipeline for non-paired samples?

I am comparing littermates of an inbred mouse strain and need to decide on the most appropriate method.

Thank you,
Peter

Snakemake fails

Dear all,

Please let me know can the pipeline and all listed dependencies (including SNAKEMAKE) be installed on Windows 10? Or should is it for use only on Linux?
Should we use Python 2.7 or Python 3.7?

All the best,

Stan

Running the pipeline with only two samples

I am running the pipeline with only two samples (one per condition) but when I get to the R script-part of the Snakefile I get the following error (probably since R is missing additional inputs from the samples I have removed):

"Finished job 2.
9 of 12 steps (75%) done
Loading counts, conditions and parameters.
Loading annotation database.
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(type, phase) :
The "phase" metadata column contains non-NA values for features of type
stop_codon. This information was ignored.
'select()' returned 1:many mapping between keys and columns
Filtering counts using DRIMSeq.
Building model matrix.
Sum transcript counts into gene counts.
Warning message:
funs() is soft deprecated as of dplyr 0.8.0
please use list() instead

Before:

funs(name = f(.))

After:

list(name = ~ f(.))
This warning is displayed once per session.
Running differential gene expression analysis using edgeR.
Warning message:
In estimateDisp.default(y = y$counts, design = design, group = group, :
No residual df: setting dispersion to NA
Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset, :
dispersion must be numeric
Calls: glmQLFit ... glmQLFit -> glmQLFit.default -> glmFit -> glmFit.default
Execution halted
Error in job de_analysis while creating output files de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv.
RuleException:
CalledProcessError in line 137 of /home/nanopore/tests/3/Snakefile:
Command '
/home/nanopore/tests/3/scripts/de_analysis.R
' returned non-zero exit status 1.
File "/home/nanopore/tests/3/Snakefile", line 137, in __rule_de_analysis
File "/home/nanopore/src/miniconda3/envs/pipeline2/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Removing output files of failed job de_analysis since they might be corrupted:
merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message"

Is there anyway to adjust the R script to take into account that I am running with two samples/one per group (control/treated)?

Fail at write.xlsx for raw salmon counts

Hi there, after running the Rmarkdown slave, the execution was halted when running this chunk:

label: geneCounts (with options) List of 1 $ echo: logi FALSE

Quitting from lines 246-257 (Nanopore_Transcriptome_Tutorial.Rmd) Error in .jcheck() : Java Exception <no description because toString() failed>.jcall(cell, "V", "setCellValue", value)new("jobjRef", jobj = <pointer: 0x5571a47861f8>, jclass = "java/lang/Throwable") Calls: <Anonymous> ... .write_block -> mapply -> <Anonymous> -> .jcall -> .jcheck

I opened the Rmarkdown in Rstudio session and ran each chunk, and again it seems to fail exporting the salmon table out to xlsx.

Is this relating to the memory that java uses? Seems file is too large to export from

This is happening on my MacBook Pro 16GB and on server.

Is there some upper limit using write.xlsx, the salmon counts is quite large

de_analysis error: "Error in `contrasts<-`(`tmp`, ..."

Dear all, dear @bsipos ,

Please help me to understand how to overcome the following error:

rule de_analysis:
    input: de_analysis/de_params.tsv, de_analysis/coldata.tsv, merged/all_counts.tsv
    output: de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv
    jobid: 17

Activating conda environment: /media/localarchive/pipeline-transcriptome-de/Workspaces/pipeline-transcriptome-de/.snakemake/conda/91bf797a
Warning message:
package ‘DRIMSeq’ was built under R version 4.0.3
Warning messages:
1: package ‘GenomicFeatures’ was built under R version 4.0.3
2: package ‘S4Vectors’ was built under R version 4.0.3
3: package ‘IRanges’ was built under R version 4.0.3
4: package ‘GenomeInfoDb’ was built under R version 4.0.3
5: package ‘GenomicRanges’ was built under R version 4.0.3
6: package ‘AnnotationDbi’ was built under R version 4.0.3
7: package ‘Biobase’ was built under R version 4.0.3
Loading counts, conditions and parameters.
Loading annotation database.
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
'select()' returned 1:many mapping between keys and columns
Filtering counts using DRIMSeq.
Building model matrix.
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
  contrasts can be applied only to factors with 2 or more levels
Calls: model.matrix -> model.matrix.default -> contrasts<-
Execution halted
[Tue Nov 24 17:56:17 2020]
Error in rule de_analysis:
    jobid: 17
    output: de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv
    conda-env: /media/localarchive/pipeline-transcriptome-de/Workspaces/pipeline-transcriptome-de/.snakemake/conda/91bf797a
    shell:

    /media/localarchive/pipeline-transcriptome-de/scripts/de_analysis.R

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

I know that is quite similar to #15 but unfortunately it still doesn't help to solve this problem.

syntax error: Input and output files have to be specified as strings or lists of strings.

Dear Nanoporetech,

I am having an issue running this. I have altered the config.yaml (pasted below). I read an another issue that full paths were required so I added these, but have removed identifying names. (also can the pipeline take compressed fq files?). LIne 15 is the transcriptome: "/PATH/TO/analysis/GRCh38.primary_assembly.genome.fa" - line. I cant see what is wrong with this. Sorry! can you please help?
Pete

I get the following error:

pipeline-transcriptome-de]$ snakemake --use-conda -j 24 all
SyntaxError:
Input and output files have to be specified as strings or lists of strings.
File "/PATH/analysis/pipeline-transcriptome-de/Snakefile", line 15, in
File "/PATH/analysis/pipeline-transcriptome-de/snakelib/utils.snake", line 15, in

General pipeline parameters:

Name of the pipeline:

pipeline: "pipeline-transcriptome-de_phe"

ABSOLUTE path to directory holding the working directory:

workdir_top: "/PATH/TO/analysis/"

Results directory:

resdir: "results"

Repository URL:

repo: "https://github.com/nanoporetech/pipeline-transcriptome-de"

Pipeline-specific parameters:

Transcriptome fasta

transcriptome: "/PATH/TO/analysis/GRCh38.primary_assembly.genome.fa"

Annotation GFF/GTF

annotation: "/PATH/TO/analysis/gencode.v39.annotation.gff3"

Control samples

control_samples:
C1: "/PATH/TO/analysis/R1_.fastq.gz"
C2: "/PATH/TO/analysis/R2_.fastq.gz"
C3: "/PATH/TO/analysis/R3_.fastq.gz"

Treated samples

treated_samples:
IR1: "/PATH/TO/analysis/R4_.fastq.gz"
IR2: "/PATH/TO/analysis/R5_.fastq.gz"
IR3: "/PATH/TO/analysis/R6_.fastq.gz"

Minimap2 indexing options

minimap_index_opts: ""

Minimap2 mapping options

minimap2_opts: ""

Maximum secondary alignments

maximum_secondary: 100

Secondary score ratio (-p for minimap2)

secondary_score_ratio: 1.0

Salmon library type

salmon_libtype: "U"

Count filtering options - customize these according to your experimental design:

Genes expressed in minimum this many samples

min_samps_gene_expr: 3

Transcripts expressed in minimum this many samples

min_samps_feature_expr: 1

Minimum gene counts

min_gene_expr: 10

Minimum transcript counts

min_feature_expr: 3

Threads

threads: 24

Error in `contrasts<-`(`tmp`, value = contr.funs[1 + isOF[nn]])

Hi!
I get the following error message with error in contrasts:

I'm not sure how to fix this? I only have two biological replicates per condition (not possible to have more, unfortunately), could that raise the issue?

Also, I got an error:
AttributeError in line 6 of /tools/pipeline-transcriptome-de/Snakefile:
'Workflow' object has no attribute 'overwrite_configfile'
but this was solved by replacing "overwrite_configfile" with "overwrite_configfiles" in the snakefile:
if not workflow.overwrite_configfiles:
configfile: "config.yml"

Thanks a lot for help!

mode(counts) %in% "numeric" is not TRUE issue when commenting strip_version does not work

Hi I've ran into the same issue and commenting out the strip_version has not helped. I'm running an NCBI genome cds file and annotation gff. Neither the GFF or the GTF seem to work even when grep -v genes that it has specific problems with.

"Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, :
some transcripts have no "transcript_id" attribute ==> their name
("tx_name" column in the TxDb object) was set to NA
2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, :
the transcript names ("tx_name" column in the TxDb object) imported
from the "transcript_id" attribute are not unique
'select()' returned 1:many mapping between keys and columns
Error in dmDSdata(counts = counts, samples = coldata) :
mode(counts) %in% "numeric" is not TRUE
Calls: dmDSdata -> stopifnot
Execution halted
[Thu Sep 9 01:51:02 2021]
Error in rule de_analysis:
jobid: 0
output: de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv
shell:

/epi2melabs/differential-expression/pipeline-transcriptome-de/scripts/de_analysis.R

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /epi2melabs/differential-expression/.snakemake/log/2021-09-09T015044.982826.snakemake.log"

GCF_900626175.1_cs10_genomic.gff.gz

GCF_900626175.1_cs10_cds_from_genomic.fna.gz

Originally posted by @C-Pauli in #18 (comment)

PDF file corrupted

Dear all!

Please, help with the dtu_plots.pdf opening error, says "Error of the format: is not PDF or file is damaged".

Best regards!

Similar but not identical Error in dmDSdata(counts = counts, samples = coldata) compare to #1:

Hi:
Hi Botond,I'm having an issue which is similar to #1
But the error message is not exactly the same with that. Because my sample names has letter prefixes.
Here are my error messages:

Make the TxDb object ... OK
Warning messages:
1: In .get_cds_IDX(type, phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon, exon. This information was ignored.
2: In .reject_transcripts(bad_tx, because) :
  The following transcripts were rejected because they have stop codons
  that cannot be mapped to an exon: AT1G07320.4, AT1G18180.2,
  AT1G30230.1, AT1G36380.1, AT1G52415.1, AT1G52940.1, AT2G18110.1,
  AT2G35130.1, AT2G35130.3, AT2G39050.1, AT3G13445.1, AT3G17660.1,
  AT3G17660.3, AT3G52070.2, AT3G54680.1, AT3G59450.2, AT4G13730.2,
  AT4G17730.1, AT4G39620.2, AT5G01520.2, AT5G22794.1, AT5G27710.2,
  AT5G45240.2
'select()' returned 1:many mapping between keys and columns
Filtering counts using DRIMSeq.
Error in dmDSdata(counts = counts, samples = coldata) : 
  mode(counts) %in% "numeric" is not TRUE
Calls: dmDSdata -> stopifnot
Execution halted
[Tue Mar  5 16:52:44 2019]
Error in rule de_analysis:
    jobid: 10
    output: de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv
    conda-env: /home/weir/output/Nanopore-DGE-Analysis/.snakemake/conda/589d7ca2

RuleException:
CalledProcessError in line 128 of /home/weir/software/pipeline-transcriptome-de/Snakefile:
Command 'source activate '/home/weir/output/Nanopore-DGE-Analysis/.snakemake/conda/589d7ca2'; set -euo pipefail;  /home/weir/software/pipeline-transcriptome-de/scripts/de_analysis.R' returned non-zero exit status 1.
  File "/home/weir/software/pipeline-transcriptome-de/Snakefile", line 128, in __rule_de_analysis
  File "/home/weir/anaconda3/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/weir/software/pipeline-transcriptome-de/.snakemake/log/2019-03-05T165022.384227.snakemake.log

Pasting the content of de_analysis/coldata.tsv

sample  condition       type
Col0_1  untreated       single-read
Col0_2  untreated       single-read
Col0_3  untreated       single-read
fip37_1 treated single-read
fip37_2 treated single-read
fip37_3 treated single-read

Pasting the head of merged/all_counts.tsv

Reference       Col0_1  Col0_2  Col0_3  fip37_1 fip37_2 fip37_3
AT1G67090.2     10935.0 15226.0 22425.0 15200.0 17476.0 16079.0
AT5G38410.3     5733.0  7984.0  12484.0 8362.0  9477.0  4255.0
AT1G29930.1     4225.0  6295.0  22283.0 10630.0 13423.0 24106.0
AT5G38420.1     3222.0  4647.0  1741.0  2569.0  3061.0  1424.0
AT5G38430.2     2826.0  4054.0  1496.0  1671.0  1875.0  1267.0
AT1G20340.1     2382.0  3245.0  5113.0  3062.0  3713.0  2819.0
AT1G79040.1     2155.0  2983.0  6866.0  3192.0  3861.0  4090.0
AT2G39730.3     1782.0  2725.0  3569.0  2003.0  2247.0  1826.0
AT5G54770.1     1770.0  2560.0  1898.0  1643.0  1996.0  1535.0

Pasting the content of de_analysis/de_params.tsv

Annotation      min_samps_gene_expr     min_samps_feature_expr  min_gene_expr   min_feature_expr
/home/weir/output/Nanopore-DGE-Analysis/Araport11_GFF3_genes_transposons.gtf    3       1       10      3

And,which Salmon library should be used?
（Another question about "pandas" has been solved,THKS : )

Snakefile AttributeError

I got this error while starting to run this pipeline, any ideas how to fix?

AttributeError in line 6 of /home/data/Megan/nanopore/pipeline-transcriptome-de/Snakefile:
'Workflow' object has no attribute 'overwrite_configfiles'
File "/home/data/Megan/nanopore/pipeline-transcriptome-de/Snakefile", line 6, in

Error in pipeline

config.txt

Hello,

I am running this pipeline with my own data produced using direct cDNA sequencing, it runs fine until step 13 but then I get the error pasted below.
The annotation I am using is in gtf format but is not from ensembl as it is for a non-model organism. The transcriptome was generated from the gtf file using gffread.

Any suggestions to get past this issue would be appreciated

Thank you!

Activating conda environment: /Nanopore/Differential_isoform_analysis/pipeline-transcriptome-de-trials/pipeline-transcriptome-de/Workspaces/pipeline-transcripto$
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
 The "phase" metadata column contains non-NA values for features of type
 stop_codon. This information was ignored.
'select()' returned 1:many mapping between keys and columns
Error in dmDSdata(counts = counts, samples = coldata) :
 mode(counts) %in% "numeric" is not TRUE
Calls: dmDSdata -> stopifnot
Execution halted
[Thu Aug 13 18:47:18 2020]
Error in rule de_analysis:
  jobid: 9
  output: de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_cou$
  conda-env: /rds-d4/project/cj107/rds-cj107-jiggins-rds/projects/eratoCortexMapping/Nanopore/Differential_isoform_analysis/pipeline-transcriptome-de-trials/pipeline-transcriptome-de/Workspaces/pipeline-transcriptome-de_phe/.snak$
  shell:
  /Nanopore/Differential_isoform_analysis/pipeline-transcriptome-de-trials/pipeline-transcriptome-de/scripts/de_analysis.R
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

"Nothing to be done" when running locally

Following the readme file, I have put the fastq files, a .fa of the reference trnascriptome, and a .gtf corresponding to the transcriptome in the data folder, and edited the config file appropriately.

When first running the pipeline, it installed all of the dependencies without a problem. Now, it gives the following message:

Building DAG of jobs...
Nothing to be done.

I cannot identify anything else I have to provide to make it work, am I setting up the config.yaml incorrectly?

plot_dtu_res error

All good til the last moment...

Does anyone have any experience as to what is throwing this error?

Config file looks fine...

`Error in rule plot_dtu_res:
jobid: 26
output: de_analysis/dtu_plots.pdf
shell:

home/pipeline-transcriptome-de/scripts/plot_dtu_results.R

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message`

Porblems with snakefile. WorkflowError in line 7.

Hi all,

Having the following issue with the workflow:

tardig@DESKTOP-NEKU62R:~/pipeline-transcriptome-de$ snakemake --use-conda -j 4 all
WorkflowError in line 7 of /home/tardig/pipeline-transcriptome-de/Snakefile:
Config file is not valid JSON or YAML. In case of YAML, make sure to not mix whitespace and tab indentation.
File "/home/tardig/pipeline-transcriptome-de/Snakefile", line 7, in

Please find snakefile and config files attached.

config.txt

Snakefile.txt

Thanks,

Stan

conda version and snakemake incompatible?

Dear all,
I tried to configure a new conda environment for this pipeline, as follows:
My conda version is 4.7.12.
I then ran:

conda env create --name ONT_transcriptome --file env.yml
conda activate ONT_transcriptome
conda install -c bioconda snakemake

The last command returned:

The following specifications were found to be incompatible with each other:
Package certifi conflicts for:
snakemake -> requests -> certifi[version='>=2017.4.17']
python=3.7 -> pip -> setuptools -> certifi[version='>=2016.09']
Package setuptools conflicts for:
python=3.7 -> pip -> setuptools
snakemake -> dropbox[version='>=5.2'] -> setuptools
Package python-dateutil conflicts for:
snakemake -> moto[version='>=0.4.14'] -> botocore[version='<1.11'] -> python-dateutil[version='>=2.1,<2.7.0|>=2.1,<3.0.0']
Package ca-certificates conflicts for:
python=3.7 -> openssl[version='>=1.1.1c,<1.1.2a'] -> ca-certificates
snakemake -> python=3.5 -> openssl=1.0 -> ca-certificates
Package wheel conflicts for:
snakemake -> python=3.5 -> pip -> wheel
python=3.7 -> pip -> wheel
Package pip conflicts for:
python=3.7 -> pip
snakemake -> python=3.5 -> pip

I now returned to another conda env that has:

python                    3.6.7             hd21baee_1002    conda-forge
snakemake                 5.4.2                         0    bioconda
pandas                    0.24.1           py36hf484d3e_0    conda-forge

Is that ok? Did you experience similar incompatibilities with conda envs since earlier this year?
Many thanks for your comments, best regards, Sophia

SyntaxError: Input and output files have to be specified as strings or lists of strings

Hello,

I am trying to run the "pipeline-transcriptome-de_phe" that was downloaded from the Github repository (https://github.com/nanoporetech/pipeline-transcriptome-de). I am running this within a Python 3 environment. The config.yml file was modified following the Github example. The pipeline runs briefly until gives me the error message below with no output files.

Error Message

"SyntaxError: Input and output files have to be specified as strings or lists of strings.

File "/work/lab/pipeline-transcriptome-de-2/Snakefile", line 15, in

File "/work/lab/pipeline-transcriptome-de-2/snakelib/utils.snake", line 15, in "

Any suggestions will be appreciated.

Error in dmDSdata(counts = counts, samples = coldata) :

Hi Botond,

I'm having an issue when I try analyzing some data.

With the .gff3:

Activating conda environment: /home/user/pipeline-transcriptome-de/Workspaces/mosquito/.snakemake/conda/81c03226
Loading counts, conditions and parameters.
Loading annotation database.
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extract_exons_from_GRanges(exon_IDX, gr, ID, Name, Parent, feature = "exon",  :
  The following orphan exon were dropped (showing only the 6 first):
  seqid  start    end strand   ID        Parent             Name
1     1  28171  28393      - <NA> AAEL020532-RA AAEL020532-RA-E2
2     1  28468  28857      - <NA> AAEL020532-RA AAEL020532-RA-E1
3     1  29507  29645      - <NA> AAEL027741-RA AAEL027741-RA-E2
4     1  29726  30127      - <NA> AAEL027741-RA AAEL027741-RA-E1
5     1 129789 129946      + <NA> AAEL021681-RA AAEL021681-RA-E1
6     1 142976 143057      + <NA> AAEL021681-RA AAEL021681-RA-E2
'select()' returned 1:many mapping between keys and columns
Filtering counts using DRIMSeq.
Error in dmDSdata(counts = counts, samples = coldata) : 
  all(samples$sample_id %in% colnames(counts)) is not TRUE
Calls: dmDSdata -> stopifnot
Execution halted
[Mon Feb 11 10:56:24 2019]
Error in rule de_analysis:
    jobid: 16
    output: de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv
    conda-env: /home/user/pipeline-transcriptome-de/Workspaces/mosquito/.snakemake/conda/81c03226

RuleException:
CalledProcessError in line 128 of /home/user/pipeline-transcriptome-de/Snakefile:
Command 'source activate /home/user/pipeline-transcriptome-de/Workspaces/mosquito/.snakemake/conda/81c03226; set -euo pipefail;  
    /home/user/pipeline-transcriptome-de/scripts/de_analysis.R ' returned non-zero exit status 1.
  File "/home/user/pipeline-transcriptome-de/Snakefile", line 128, in __rule_de_analysis
  File "/home/user/miniconda3/envs/pipetrans/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/user/pipeline-transcriptome-de/.snakemake/log/2019-02-11T102246.103956.snakemake.log

With the .gtf:

Loading counts, conditions and parameters.
Loading annotation database.
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(type, phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
'select()' returned 1:many mapping between keys and columns
Filtering counts using DRIMSeq.
Error in dmDSdata(counts = counts, samples = coldata) : 
  all(samples$sample_id %in% colnames(counts)) is not TRUE
Calls: dmDSdata -> stopifnot
Execution halted
[Mon Feb 11 11:20:36 2019]
Error in rule de_analysis:
    jobid: 16
    output: de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv
    conda-env: /home/user/pipeline-transcriptome-de/Workspaces/mosquito/.snakemake/conda/81c03226

RuleException:
CalledProcessError in line 128 of /home/user/pipeline-transcriptome-de/Snakefile:
Command 'source activate /home/user/pipeline-transcriptome-de/Workspaces/mosquito/.snakemake/conda/81c03226; set -euo pipefail;  
    /home/user/pipeline-transcriptome-de/scripts/de_analysis.R ' returned non-zero exit status 1.
  File "/home/user/pipeline-transcriptome-de/Snakefile", line 128, in __rule_de_analysis
  File "/home/user/miniconda3/envs/pipetrans/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/user/pipeline-transcriptome-de/.snakemake/log/2019-02-11T110011.038209.snakemake.log

The transcripts and annotations were downloaded from:
https://www.vectorbase.org/downloads?field_organism_taxonomy_tid%5B%5D=372&field_status_value=Current

I used:
Aedes-aegypti-LVP_AGWG_TRANSCRIPTS_AaegL5.1.fa
Aedes-aegypti-LVP_AGWG_BASEFEATURES_AaegL5.1.gff3
Aedes-aegypti-LVP_AGWG_BASEFEATURES_AaegL5.1.gtf

I was running a test with 12 barcoded samples on a couple flow cells so the reads range from 120k to 500k each.

Cheers

salmon output --noLengthCorrection

COMBINE-lab/salmon#602 (comment)

Having read this I was wondering if I should be using the --noLengthCorrection that mean the TPM will be more meaningful for Nanopore reads and perhaps the NumReads, although for this I am unsure what this is.

This pipeline is using NumReads for downstream analysis but I was wondering if this also models length bias like TPM in salmon. If so, would --noLengthCorrection be necessary?

counts are not gene

The quant.sf files that are generated by sample (pipeline-transcriptome-de_phe/counts/$samplename_salmon), when I run the pipeline, look like this (mouse genome):

Name    Length  EffectiveLength TPM     NumReads
chr1    195471971       195471722.000   37832.318871    94666.189
chr2    182113224       182112975.000   44177.212938    102988.135
chr3    160039680       160039431.000   56700.439152    116161.287
chr4    156508116       156507867.000   62252.200494    124720.796
chr5    151834684       151834435.000   21599.692789    41982.263
chr6    149736546       149736297.000   28423.282601    54481.532
chr7    145441459       145441210.000   93232.389098    173581.039
chr8    129401213       129400964.000   53266.152592    88234.176
chr9    124595110       124594861.000   28806.345603    45944.792
chr10   130694993       130694744.000   109556.806254   183292.831
chr11   122082543       122082294.000   86232.333615    134763.044
chr12   120129022       120128773.000   51734.803277    79556.897
chr13   120421639       120421390.000   30402.046775    46865.629
chr14   124902244       124901995.000   9604.455594     15356.425
chr15   104043685       104043436.000   90960.906485    121148.337
chr16   98207768        98207519.000    17191.145805    21612.129
chr17   94987271        94987022.000    28660.455170    34849.403
chr18   90702639        90702390.000    38024.378414    44149.811
chr19   61431566        61431317.000    99570.377773    78301.120
chrX    171031299       171031050.000   11767.443247    25763.512
chrY    91744698        91744449.000    4.813453        5.653

Isn't it supposed to hold the counts for each transcript in the annotated genome?
As a consequence, other errors come up in the steps run in R for DE.
Any help or suggestion is very much appreciated.
Best, Sophia

Question about input files

Hi ：
I have two questions about differential expression：
1.Do I need to filter out low base quality reads（some extremely long ONT reads with low base quality）?
2.Does the sequencing depth of biological replicates require close proximity?
In other words,I have a biologically repeating sample with a higher depth of sequencing than the other two.Do I need to merge low-depth samples into one?
thanks!

Error in running de_analysis script

Based on coldata.tsv, seems OK

config.yml.txt

`rule de_analysis:
input: de_analysis/de_params.tsv, de_analysis/coldata.tsv, merged/all_counts.tsv
output: de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv
jobid: 3

Loading counts, conditions and parameters.
Loading annotation database.
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type
stop_codon. This information was ignored.
'select()' returned 1:many mapping between keys and columns
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 3, 5, 4, 7, 6
Calls: strip_version ... as.data.frame -> as.data.frame.list -> do.call ->
Execution halted
Error in job de_analysis while creating output files de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv.
RuleException:
CalledProcessError in line 137 of /home/callum/pipeline-transcriptome-de/Snakefile:
Command '
/home/callum/pipeline-transcriptome-de/scripts/de_analysis.R
' returned non-zero exit status 1.
File "/home/callum/pipeline-transcriptome-de/Snakefile", line 137, in __rule_de_analysis
File "/home/callum/miniconda3/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message
`

Rule exception when running pipeline-transcriptome-de/scripts/R

Hey @bsipos sorry about posting in wrong repository yesterday. I still had problems writing out such large xlsx file, so instead I switched to run my datas through the pipeline without the context of the tutorial Rmarkdown.

But I ran into issue using the R scripts at the end.

`Error in rule de_analysis:
jobid: 10
output: de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv
conda-env: /Users/callum/pipeline-transcriptome-de/Workspaces/pipeline-transcriptome-de_phe/.snakemake/conda/b39bc3b2

RuleException:
CalledProcessError in line 128 of /Users/callum/pipeline-transcriptome-de/Snakefile:
Command 'source activate /Users/callum/pipeline-transcriptome-de/Workspaces/pipeline-transcriptome-de_phe/.snakemake/conda/b39bc3b2; set -euo pipefail;
/Users/callum/pipeline-transcriptome-de/scripts/de_analysis.R ' returned non-zero exit status 1.
File "/Users/callum/pipeline-transcriptome-de/Snakefile", line 128, in __rule_de_analysis
File "/Users/callum/miniconda3/lib/python3.6/concurrent/futures/thread.py", line 56, in run`

Do I need to do anything else other than install via conda snakemake and pandas, then run snakemake?

bioconductor-genomeinfodbdata-1.2.3-r40_0

This error when snakemake initiated
`ERROR conda.core.link:_execute(698): An error occurred while installing package 'bioconda::bioconductor-genomeinfodbdata-1.2.3-r40_0'.
Rolling back transaction: ...working... done

==> script messages <==

==> script output <==
stdout: ERROR: post-link.sh was unable to download any of the following URLs with the md5sum 720784da6bddbd4e18ab0bccef6b0a95:
https://bioconductor.org/packages/3.11/data/annotation/src/contrib/GenomeInfoDbData_1.2.3.tar.gz
https://bioarchive.galaxyproject.org/GenomeInfoDbData_1.2.3.tar.gz
https://depot.galaxyproject.org/software/bioconductor-genomeinfodbdata/bioconductor-genomeinfodbdata_1.2.3_src_all.tar.gz`

The pipeline was abruptly terminated in the first step

Hi
When I run snakemake, the program terminates after completing the first step, but no errors are raised.

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       dump_versions
        1

[Sat Jun  8 20:58:14 2019]
rule dump_versions:
    output: versions.txt
    jobid: 0

[Sat Jun  8 20:58:29 2019]
Finished job 0.
1 of 1 steps (100%) done

Then I try to manually run the second step (snakemake -j 8 -f build_minimap_index
), the minimap index is correctly established without error.

Error in dmDSdata

Hi,

We are using pipeline from nanopore: Pipeline for differential gene expression (DGE) and differential transcript usage (DTU) analysis using long reads and we got stuck on the step with DRIMSeq on our own data (
using Arabidopsis genome annotation file TAIR10_GFF3_genes.gff and TAIR10_cdna_20101212_updated.fna). I think the problem comes from de_analysis.R script and pointing to txdf$TXNAME with version numbers and rownames(cts) the version numbers were removed. Therefore, none would be matched to input into dmDSdata(). Maybe you guys can look into it more.

Thanks,

Warning message:
package ‘DRIMSeq’ was built under R version 4.0.3
Warning messages:
1: package ‘GenomicFeatures’ was built under R version 4.0.3
2: package ‘S4Vectors’ was built under R version 4.0.3
3: package ‘IRanges’ was built under R version 4.0.3
4: package ‘GenomeInfoDb’ was built under R version 4.0.3
5: package ‘GenomicRanges’ was built under R version 4.0.3
6: package ‘AnnotationDbi’ was built under R version 4.0.3
7: package ‘Biobase’ was built under R version 4.0.3
Loading counts, conditions and parameters.
Loading annotation database.
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extract_exons_from_GRanges(exon_IDX, gr, mcols0, tx_IDX, feature = "exon", :
6277 exons couldn't be linked to a transcript so were dropped (showing
only the first 6):
seqid start end strand ID Name Parent Parent_type
1 Chr1 433031 433819 - AT1G02228.1
2 Chr1 846664 847739 + AT1G03420.1
3 Chr1 2070737 2070893 + AT1G06740.1
4 Chr1 2071102 2073535 + AT1G06740.1
5 Chr1 2415041 2415970 + AT1G07800.1
6 Chr1 2531695 2534786 - AT1G08105.1
2: In .extract_exons_from_GRanges(cds_IDX, gr, mcols0, tx_IDX, feature = "cds", :
197160 CDS couldn't be linked to a transcript so were dropped (showing
only the first 6):
seqid start end strand ID Name Parent Parent_type
1 Chr1 3760 3913 + AT1G01010.1-Protein
2 Chr1 3996 4276 + AT1G01010.1-Protein
3 Chr1 4486 4605 + AT1G01010.1-Protein
4 Chr1 4706 5095 + AT1G01010.1-Protein
5 Chr1 5174 5326 + AT1G01010.1-Protein
6 Chr1 5439 5630 + AT1G01010.1-Protein
'select()' returned 1:many mapping between keys and columns
Filtering counts using DRIMSeq.
Error in dmDSdata(counts = counts, samples = coldata) :
mode(counts) %in% "numeric" is not TRUE
Calls: dmDSdata -> stopifnot
Execution halted
[Tue Mar 9 15:27:59 2021]

Error in rule de_analysis: jobid: 9

No matter how hard I try I can't seem to get over this error

Loading counts, conditions and parameters.
Loading annotation database.
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
'select()' returned 1:many mapping between keys and columns
Error in rule de_analysis:
    jobid: 9
    output: de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv
    conda-env: /home/mustafa/pipeline-transcriptome-de/Workspaces/ExperimentPipelines/.snakemake/conda/fe062354
    shell:

    /home/mustafa/pipeline-transcriptome-de/scripts/de_analysis.R

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/mustafa/pipeline-transcriptome-de/.snakemake/log/2020-08-26T192935.491547.snakemake.log

Error with alignment to reference in gene_expression tutorial

I'm running the gene_expression tutorial on a MacBook Air with the Apple chip.

I get the following error:

Error in rule build_minimap_index:
jobid: 0
output: index/transcriptome_index.mmi
shell:

    minimap2 -t 4  -I 1000G -d index/transcriptome_index.mmi /epi2melabs/differential-expression/sample_data/Homo_sapiens.GRCh38.cdna.all.fa.gz

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Is there a solution to this? I have just run the tutorial straightforwardly without any tinkering so it looks like a scripting/parameterization issue

Full log:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Conda environments: ignored
Job counts:
count jobs
1 build_minimap_index
1

[Thu Aug 5 10:42:13 2021]
rule build_minimap_index:
input: /epi2melabs/differential-expression/sample_data/Homo_sapiens.GRCh38.cdna.all.fa.gz
output: index/transcriptome_index.mmi
jobid: 0
threads: 4

[Thu Aug 5 10:42:32 2021]
Error in rule build_minimap_index:
jobid: 0
output: index/transcriptome_index.mmi
shell:

    minimap2 -t 4  -I 1000G -d index/transcriptome_index.mmi /epi2melabs/differential-expression/sample_data/Homo_sapiens.GRCh38.cdna.all.fa.gz

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job build_minimap_index since they might be corrupted:
index/transcriptome_index.mmi
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /epi2melabs/differential-expression/.snakemake/log/2021-08-05T104211.761394.snakemake.log

Salmon version

Which version of Salmon is fit for this pipeline?
I have version 0.7.2 and it does't support --noErrorModel

nanoporetech / pipeline-transcriptome-de Goto Github PK

pipeline-transcriptome-de's Introduction

Getting Started

Input

Output

Dependencies

Layout

Installation

Usage

Help

Licence and Copyright

References and Supporting Information

pipeline-transcriptome-de's People

Contributors

Stargazers

Watchers

Forkers

pipeline-transcriptome-de's Issues

Before:

After:

General pipeline parameters:

Name of the pipeline:

ABSOLUTE path to directory holding the working directory:

Results directory:

Repository URL:

Pipeline-specific parameters:

Transcriptome fasta

Annotation GFF/GTF

Control samples

Treated samples

Minimap2 indexing options

Minimap2 mapping options

Maximum secondary alignments

Secondary score ratio (-p for minimap2)

Salmon library type

Count filtering options - customize these according to your experimental design:

Genes expressed in minimum this many samples

Transcripts expressed in minimum this many samples

Minimum gene counts

Minimum transcript counts

Threads

Recommend Projects

Recommend Topics

Recommend Org