Giter Site home page Giter Site logo

subread_to_dexseq's Introduction

Subread_to_DEXSeq

Vivek Bhardwaj

These functions provide a way to use featurecounts output for DEXSeq

The directory contains two scripts:

  1. dexseq_prepare_annotation2.py : It's same as the "dexseq_prepare_annotation.py" that comes with DEXSeq, but with an added option to output featureCounts-readable GTF file.

  2. load_SubreadOutput.R : Provides a function "DEXSeqDataSetFromFeatureCounts", to load the output of featureCounts as a dexSeq dataset (dxd) object.

Usage example

1) Prepare annotation

Syntax :

python dexseq_prepare_annotation2.py -f <featurecounts.gtf> <input.gtf> <dexseq_counts.gff>

Example :

python dexseq_prepare_annotation2.py -f dm6_ens76_flat.gtf dm6_ens76.gtf dm6_ens76_flat.gff

you will get a file "dm6_ens76_flat.gff" and another "dm6_ens76_flat.gtf" (for featurecounts)

2) Count using Subread (command line)

We use the -f options to count reads overlapping features.

We can use the -O option to count the reads overlapping to multiple exons (similar to DEXSeq_count).

/path/to/subread/bin/featureCounts -f -O -s 2 -p -T 40 \
-F GTF -a dm6_ens76_flat.gtf \
-o dm6_fCount.out Cont_1.bam Cont_2.bam Test_1.bam Test_2.bam

3) load into DEXSeq**

This script requires dplyr, and DEXSeq installed in your R..

In R prepare a sampleData data.frame, which contains sample names used for featurecounts as rownames, plus condition, and other variables you want to use for DEXSeq design matrix.

Example :

source("load_SubreadOutput.R")
samp <- data.frame(row.names = c("cont_1","cont_2","test_1","test_2"), 
                        condition = rep(c("control","trt"),each=2))
dxd.fc <- DEXSeqDataSetFromFeatureCounts("dm6_fCount.out",
                                         flattenedfile = "dm6_ens76_flat.gtf",sampleData = samp)

This will create a dxd object that you can use for DEXSeq analysis.

Results

On a real dataset from drosophila (mapped to dm6). I compared the output from featurecounts (two modes) and DEXSeq_Counts.

In unique mode, fragments overlapping multiple features are not counted, while in multi mode, they are counted.

Dispersion Estimates

Results

Number of differentially expressed exons with 10% FDR. The output from featurecounts is highly similar to DEXSeq_Count, when we count the multi-feature overlapping reads (-O option).

subread_to_dexseq's People

Contributors

jvrakor avatar vivekbhr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

subread_to_dexseq's Issues

geneID/featureID truncated

I saw a warning that one geneID got truncated, and in the R codes it said this is needed to match
featureCounts. Could this truncate step be skipped now?

long names of gene sets in the flattened gtf stop the R script

Hey Vivek,
Thanks for publishing this nice hack for combining the two useful genomic tools!

I have tried it with hg38 from Encembl v86. The flattened annotation file happens to have one set having the complete name 351 chars long - it is PCDHG "cascade of genes". This gets truncated to 256 chars by featureCounts and causes your script to stop due not matching names.

I have deleted this set from the gtf together with its 99 exons (as likely misleading for splicing anyway) and did the counting with such gtf, then everything worked OK. But perhaps there can be more polite solution your own way, such as truncating all names(exoninfo) to 256 characters etc....

Cheers,
Michal

Subread_to_DEXseq problem

Hi,

I'm trying to use your program in order to use featureCount as input for DEXseq, but I'm facing some troubles with it.

I'm working with Homo_sapiens.GRCh38.91.chr.gtf release from ensembl

capture d ecran 2018-05-23 a 15 08 57

I followed the tutorial, as I am on mac, I modify the dexseq_prepare_annotation2.py, by adding a .bak after 'sed -i.....' -> 'sed -i.bak....' at the end of the script to do not have error during the GTF preparation.
So I used the following command :

capture d ecran 2018-05-23 a 15 19 54

And after that start to use featureCounts-1.6

capture d ecran 2018-05-23 a 15 16 40

As you can see, featureCounts doesn't support the GTF file that generate the script. Do you have any idea what's going wrong with this procedure ?

Thanks for the help, and the tutorial
Nicolas

Something wrong with featureCounts

Hi,
firstly, thanks for your wonderful scripts.

I'm trying to use dexseq_prepare_annotation2.py to prepare the needed GTF for featureCounts and DEXSeq, my primary GTF file is Homo_sapiens.GRCh37.75.gtf which downloaded from ensembl.Then, This step is running successfully whithout any error.

However, when I ran featureCounts, something wrong happened, WARNING was reported as follow:
image

I wonder if this mistake gets in the way? Will it affect the subsequent analysis? Since my subsequent analysis also reported errors, I don't know what went wrong.

dxd.fc <- DEXSeqDataSetFromFeatureCounts("~/dexseq/test_out3.txt",flattenedfile = "~/annotation/GRCh37_feature.gtf",sampleData = samp)

Reading and adding Exon IDs for DEXSeq
Error in DEXSeqDataSetFromFeatureCounts("~/dexseq/test_out3.txt", flattenedfile = "~/annotation/GRCh37_feature.gtf", :
Count files do not correspond to the flattened annotation file
此外: Warning message:
1 aggregate geneIDs were found truncated in featureCounts outpu

Thanks
Jessian

"-t" argument, exonic_part

Though I saw a closed issue said the exonic_part should be added, I found the dexseq_prepare_annotation2.py does NOT work with that argument. There is no exonic_part
in the dm6_ens76_flat.gtf at all, it only appears in the dm6_ens76_flat.gff

Am I right that the "-t" parameter should be deleted?

featureCounts parameters

Hi Vivek.

Thanks for providing the Subread_to_DEXSeq scripts.

I believe there are two typos in the featureCounts example you give
/path/to/subread-1.4.6-p2/bin/featureCounts -f -O -s 2 -p -T 40 -F GTF -a dm6_ens76_flat.gtf -o dm6_fCount.out Cont_1.bam Cont_2.bam Test_1.bam Test_2.bam

  1. You use -s 2 but this would count reads if they are located on the opposite strand of the annotated feature. Are you sure this is correct?
  2. Additionally, if I use your dexseq_prepare_annotation2.py to prepare the flattened GTF file, I need to add -t exonic_part to the featureCounts arguments to perform counting at the exon level.

Cheers,
Maurits

Error in source("load_SubreadOutput.R")

Hi im new in this and i got yhis error, please help me

Error in source("load_SubreadOutput.R") :
load_SubreadOutput.R:1:40: unexpected ','
1: {"payload":{"allShortcutsEnabled":false,

FeatureCount parameters

Why is the Successfully assigned alignments so low 37.3% when i use featureCounts -f -O -p -s 2 -T 16 -F GTF -a ../vm25self.DEXSeq.gtf -o ../vm25selfallDEXSeqcount.txt test1.sort.bam test2.sort.bam
what is more, my data is paired-end, but commandline display The reads are assigned on the single-end mode

Process BAM file test1.sort.bam... ||
|| Strand specific : reversely stranded ||
|| Paired-end reads are included. ||
|| The reads are assigned on the single-end mode. ||
|| Total alignments : 106372557 ||
|| Successfully assigned alignments : 39673738 (37.3%) ||
|| Running time : 0.18 minutes

Thanks a lot

GTF to GFF transformation

I downloaded the Homo_sapiens.GRCh38.102.gtf from Ensembl, but I am getting this error for the GTF to GFF transformation.

python /opt/anaconda3/lib/R/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py Homo_sapiens.GRCh38.102.gtf dexseq.gff

File "/opt/anaconda3/lib/R/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py", line 129
raise ValueError, "Same name found on two chromosomes: %s, %s" % ( str(l[i]), str(l[i+1]) )

^
SyntaxError: invalid syntax

so how can I to fix it?
thank you!

Error in DEXSeqDataSetFromFeatureCounts

Hi,

I'm trying to use your code to run DEXSeq using featureCounts output, but when I run

dxd.fc <- DEXSeqDataSetFromFeatureCounts("counts_fCount.out", flattenedfile = "featurecountsgtf",sampleData = samp)

I get the following error message:

Error in DESeqDataSet(rse, design, ignoreRank = TRUE) :
some values in assay are not integers
In addition: Warning message:
Error in DESeqDataSet(rse, design, ignoreRank = TRUE) :
some values in assay are not integers

Could you please help me out with this as I'm not sure what the problem is.

Thanks,
Sneha

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.