I have been running flair in a conda environment created from your flair_env_conda.yam

Thanks for your answers and your time! Sebastian

diffExp outputs about flair HOT 8 CLOSED

brookslabucsc commented on August 11, 2024

diffExp outputs

from flair.

Comments (8)

csoulette commented on August 11, 2024

Hi Sebastian,

Am I getting all the output files?

In order to get a "_*results.tsv" file from each of the analyses, DESeq2 and DRIMseq must run to completion. Therefore, looking at the output files you've listed, it seems that diffExp is running to completion. Though I wouldn't expect it, I would check that each file is not empty.

Or is the error in the "dge_stderr.txt" file truncating/abrogating the output?

The dge_stderr.txt is always created for each run of diffExp. The presence of it does not always mean that the run has encountered any fatal errors causing the module to exit. This file is used more as a diagnostic for why a particular table or plot was not generated. From your output, it seems like there may have been an issue with the isoform count input for DRIMSeq.

How is the "_shrinkage" files different from the corresponding .tsv files?

The "_shrinkage" tables contain all of the same information as the non-shrinkage table, except for the logFoldChange column. For the "_shrinkage" table, a shrinkage estimator was applied (part of the DESeq2 suggested pipeline) to better rank genes. See Log fold change shrinkage for visualization and ranking ). To visualize the differences in logFoldChange calculations, you can take a look at the MA plots generated by the diffExp module.

Should replicates of the same condition be denoted by a different batch descriptor (in flair quantify) or is the variation between replicates taken into account by flair diffExp?

The "should" is entirely up to the user. The diffExp module currently does not assess the input data for batch-specific variability that could occur within a condition. I would suggest using the PCA plots generated by the diffExp module to assess whether or not there is unwanted variability in your experiment. More thoroughly, there are many bioconductor tools for assessing batch effects hiding in your data (i.e. https://bioconductor.org/packages/release/bioc/html/variancePartition.html ). If there are known-batch effects in your data, then you can include this information at the quantify step ( https://github.com/BrooksLabUCSC/flair#quantify ).

Finally: can you elaborate on the '-e' flag in flair diffExp (default=10) and how this influences the stringency of the analysis?

We do this filtering to remove low expression counts, which helps to increase the speed of the diffExp run (the underlying DESeq2 algorithm also does expression filtering to some extent). Specifically, we remove genes/isoforms where the minimum expression counts is less than '-e'. The rationale here is that high expression counts have higher power for DifferentialExpression (DE) detection in contrast to low expression counts. Therefore, we simply prefilter low counts that would otherwise test negative for DE. In any case, running diffExp with '-e 1' or any other value < 10 should not strongly influence your run. Low counts with high variability will either be filtered out by DESeq2 or called negative for DE.

Let me know if I can elaborate on anything else, thanks!

-CMS

from flair.

drc111 commented on August 11, 2024

Hi Cameron,

Thank you for your comprehensive - and fast! - answers.

Just a follow-up question to 1/2):

If the 'diu_Treatment_v_CTRL_drimseq2_results.tsv' file is created after DRIMSeq completion, and it contains all the expected information, can I assume that the DRIMSeq analysis was successful despite the 'dge_stderr.txt' output?

Best,
Sebastian

from flair.

csoulette commented on August 11, 2024

Hi Sebastian,

We haven't encountered the Pandas warning you're diffExp run is throwing, and usually errors with converting data frames will either produce an empty table or cause a fatal error. One way to QC the diuresults.tsv output table would be to look at how many tests DRIMSeq actually preformed (this would be _the number of lines from the diu.tsv table_ that have a value greater than 0 in the last column). You can compare this number to the total number of isoforms that fell above your expression cutoff (this would be the number of lines from the "filtered_iso_counts_ds2.tsv" file generated when running diffExp). The two numbers won't be exactly the same, but it is expected that number of tests preformed by DRIMSeq to be equivalent or greater than the number of isoforms that fell above your expression cutoff.

-CMS

from flair.

drc111 commented on August 11, 2024

Sorry to bother you once again, however, upon inspecting the tsv-output files from DESeq2, I noticed that all log2FoldChange values are > 0, suggesting that genes are only upregulated
(file: dge_CTRL_v_treatment_deseq2_results.tsv).

Furthermore, the log2FoldChange values approximately correspond to the mean read count in the "filtered_gene_counts_ds2.tsv" file.

But looking at the MA-plot it is evident that some log2FoldChange should be < 0.

Do you have any suggestions to what is going wrong?

I have also provided the "filtered_gene_counts_ds2.tsv" (produced following flair diffExp) as well as the "reads_manifest" and "counts_matrix" provided to/from flair quantify, respectively.

DESeq2_files.zip

Thank you for your time!

Sebastian

from flair.

csoulette commented on August 11, 2024

Hey Sebastian,

Thanks for providing your input output files. I've taken a look at the MA plot and the DGE DESeq2 output table and I think there may be confusion regarding the column indexing. The way in which we print out the results from DESeq2 is such that there is no column header for the first column (the feature_id column). In other words, the the second (the first column with numeric values) corresponds to the baseMean column header, and the third column corresponds to the log2FoldChange column header. Hope this makes things clear!

-CMS

from flair.

drc111 commented on August 11, 2024

Thank you very much for this clarification, you are absolutely right!

Lastly, I am also not sure about whether log2foldchange is calculated for CTRL/treatment or treatment/CTRL in this case.

Can this be derived from the file names or the reads_manifest.tsv?

Thank you!

Sebastian

from flair.

csoulette commented on August 11, 2024

Hi Sebastian,

More than happy to clarify!

The relative log2FoldChange should be described by the naming convention of your output tables. I.E. "dge_wt_v_mt_deseq2_results.tsv," should be Log2(WT/MT). You can check this by comparing "normalized" gene counts for the gene with the highest/lowest log2foldchange. I'm not sure if there is a way to manipulate the file structure of your input files so that deseq2 outputs the fold change in the direction that you want, but it is definitely something we will try to fix.

Thanks~

-CMS

from flair.

drc111 commented on August 11, 2024

Thanks for your answers and your time!

Sebastian

from flair.

diffExp outputs about flair HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent