vivekbhr / subread_to_dexseq Goto Github PK

View Code? Open in Web Editor NEW

29.0 6.0 19.0 959 KB

Scripts to import your FeatureCounts output into DEXSeq

License: GNU General Public License v3.0

R 37.39% Python 62.61%

rna-seq featurecounts dexseq

subread_to_dexseq's Introduction

Subread_to_DEXSeq

Vivek Bhardwaj

These functions provide a way to use featurecounts output for DEXSeq

The directory contains two scripts:

dexseq_prepare_annotation2.py : It's same as the "dexseq_prepare_annotation.py" that comes with DEXSeq, but with an added option to output featureCounts-readable GTF file.
load_SubreadOutput.R : Provides a function "DEXSeqDataSetFromFeatureCounts", to load the output of featureCounts as a dexSeq dataset (dxd) object.

Usage example

1) Prepare annotation

Syntax :

python dexseq_prepare_annotation2.py -f <featurecounts.gtf> <input.gtf> <dexseq_counts.gff>

Example :

python dexseq_prepare_annotation2.py -f dm6_ens76_flat.gtf dm6_ens76.gtf dm6_ens76_flat.gff

you will get a file "dm6_ens76_flat.gff" and another "dm6_ens76_flat.gtf" (for featurecounts)

2) Count using Subread (command line)

We use the -f options to count reads overlapping features.

We can use the -O option to count the reads overlapping to multiple exons (similar to DEXSeq_count).

/path/to/subread/bin/featureCounts -f -O -s 2 -p -T 40 \
-F GTF -a dm6_ens76_flat.gtf \
-o dm6_fCount.out Cont_1.bam Cont_2.bam Test_1.bam Test_2.bam

3) load into DEXSeq**

This script requires dplyr, and DEXSeq installed in your R..

In R prepare a sampleData data.frame, which contains sample names used for featurecounts as rownames, plus condition, and other variables you want to use for DEXSeq design matrix.

Example :

source("load_SubreadOutput.R")
samp <- data.frame(row.names = c("cont_1","cont_2","test_1","test_2"), 
                        condition = rep(c("control","trt"),each=2))
dxd.fc <- DEXSeqDataSetFromFeatureCounts("dm6_fCount.out",
                                         flattenedfile = "dm6_ens76_flat.gtf",sampleData = samp)

This will create a dxd object that you can use for DEXSeq analysis.

Results

On a real dataset from drosophila (mapped to dm6). I compared the output from featurecounts (two modes) and DEXSeq_Counts.

In unique mode, fragments overlapping multiple features are not counted, while in multi mode, they are counted.

Dispersion Estimates

Results

Number of differentially expressed exons with 10% FDR. The output from featurecounts is highly similar to DEXSeq_Count, when we count the multi-feature overlapping reads (-O option).

subread_to_dexseq's People

Contributors

Stargazers

Watchers

Forkers

davymoon cwt1 abrar-alshaer anitalakatos kaji331 jvrakor yangj9932 smyang2018 masaver silviamorins ismail-curie unique379r louiseadore0808 geng-lee amhaslam hassanfa xia1yang qhuitan finitenet

subread_to_dexseq's Issues

geneID/featureID truncated

I saw a warning that one geneID got truncated, and in the R codes it said this is needed to match
featureCounts. Could this truncate step be skipped now?

long names of gene sets in the flattened gtf stop the R script

Hey Vivek,
Thanks for publishing this nice hack for combining the two useful genomic tools!

I have tried it with hg38 from Encembl v86. The flattened annotation file happens to have one set having the complete name 351 chars long - it is PCDHG "cascade of genes". This gets truncated to 256 chars by featureCounts and causes your script to stop due not matching names.

I have deleted this set from the gtf together with its 99 exons (as likely misleading for splicing anyway) and did the counting with such gtf, then everything worked OK. But perhaps there can be more polite solution your own way, such as truncating all names(exoninfo) to 256 characters etc....

Cheers,
Michal

Subread_to_DEXseq problem

Hi,

I'm trying to use your program in order to use featureCount as input for DEXseq, but I'm facing some troubles with it.

I'm working with Homo_sapiens.GRCh38.91.chr.gtf release from ensembl

I followed the tutorial, as I am on mac, I modify the dexseq_prepare_annotation2.py, by adding a .bak after 'sed -i.....' -> 'sed -i.bak....' at the end of the script to do not have error during the GTF preparation.
So I used the following command :

And after that start to use featureCounts-1.6

As you can see, featureCounts doesn't support the GTF file that generate the script. Do you have any idea what's going wrong with this procedure ?

Thanks for the help, and the tutorial
Nicolas

Something wrong with featureCounts

Hi,
firstly, thanks for your wonderful scripts.

I'm trying to use dexseq_prepare_annotation2.py to prepare the needed GTF for featureCounts and DEXSeq, my primary GTF file is Homo_sapiens.GRCh37.75.gtf which downloaded from ensembl.Then, This step is running successfully whithout any error.

However, when I ran featureCounts, something wrong happened, WARNING was reported as follow:

I wonder if this mistake gets in the way? Will it affect the subsequent analysis? Since my subsequent analysis also reported errors, I don't know what went wrong.

dxd.fc <- DEXSeqDataSetFromFeatureCounts("~/dexseq/test_out3.txt",flattenedfile = "~/annotation/GRCh37_feature.gtf",sampleData = samp)

Reading and adding Exon IDs for DEXSeq
Error in DEXSeqDataSetFromFeatureCounts("~/dexseq/test_out3.txt", flattenedfile = "~/annotation/GRCh37_feature.gtf", :
Count files do not correspond to the flattened annotation file
此外: Warning message:
1 aggregate geneIDs were found truncated in featureCounts outpu

Thanks
Jessian

problem for DEXseq

"-t" argument, exonic_part

Though I saw a closed issue said the exonic_part should be added, I found the dexseq_prepare_annotation2.py does NOT work with that argument. There is no exonic_part
in the dm6_ens76_flat.gtf at all, it only appears in the dm6_ens76_flat.gff

Am I right that the "-t" parameter should be deleted?

featureCounts parameters

Hi Vivek.

Thanks for providing the Subread_to_DEXSeq scripts.

I believe there are two typos in the featureCounts example you give
/path/to/subread-1.4.6-p2/bin/featureCounts -f -O -s 2 -p -T 40 -F GTF -a dm6_ens76_flat.gtf -o dm6_fCount.out Cont_1.bam Cont_2.bam Test_1.bam Test_2.bam

You use -s 2 but this would count reads if they are located on the opposite strand of the annotated feature. Are you sure this is correct?
Additionally, if I use your dexseq_prepare_annotation2.py to prepare the flattened GTF file, I need to add -t exonic_part to the featureCounts arguments to perform counting at the exon level.

Cheers,
Maurits

Error in source("load_SubreadOutput.R")

Hi im new in this and i got yhis error, please help me

Error in source("load_SubreadOutput.R") :
load_SubreadOutput.R:1:40: unexpected ','
1: {"payload":{"allShortcutsEnabled":false,

FeatureCount parameters

Why is the Successfully assigned alignments so low 37.3% when i use featureCounts -f -O -p -s 2 -T 16 -F GTF -a ../vm25self.DEXSeq.gtf -o ../vm25selfallDEXSeqcount.txt test1.sort.bam test2.sort.bam
what is more, my data is paired-end, but commandline display The reads are assigned on the single-end mode

Process BAM file test1.sort.bam... ||
|| Strand specific : reversely stranded ||
|| Paired-end reads are included. ||
|| The reads are assigned on the single-end mode. ||
|| Total alignments : 106372557 ||
|| Successfully assigned alignments : 39673738 (37.3%) ||
|| Running time : 0.18 minutes

Thanks a lot

GTF to GFF transformation

I downloaded the Homo_sapiens.GRCh38.102.gtf from Ensembl, but I am getting this error for the GTF to GFF transformation.

python /opt/anaconda3/lib/R/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py Homo_sapiens.GRCh38.102.gtf dexseq.gff

File "/opt/anaconda3/lib/R/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py", line 129
raise ValueError, "Same name found on two chromosomes: %s, %s" % ( str(l[i]), str(l[i+1]) )

^
SyntaxError: invalid syntax

so how can I to fix it?
thank you!

Error in DEXSeqDataSetFromFeatureCounts

Hi,

I'm trying to use your code to run DEXSeq using featureCounts output, but when I run

dxd.fc <- DEXSeqDataSetFromFeatureCounts("counts_fCount.out", flattenedfile = "featurecountsgtf",sampleData = samp)

I get the following error message:

Error in DESeqDataSet(rse, design, ignoreRank = TRUE) :
some values in assay are not integers
In addition: Warning message:
Error in DESeqDataSet(rse, design, ignoreRank = TRUE) :
some values in assay are not integers

Could you please help me out with this as I'm not sure what the problem is.

Thanks,
Sneha

vivekbhr / subread_to_dexseq Goto Github PK

subread_to_dexseq's Introduction

Subread_to_DEXSeq

These functions provide a way to use featurecounts output for DEXSeq

Usage example

Results

Dispersion Estimates

Results

subread_to_dexseq's People

Contributors

Stargazers

Watchers

Forkers

subread_to_dexseq's Issues

Recommend Projects

Recommend Topics

Recommend Org