center-for-health-data-science / bulkdgd Goto Github PK

Code, documentation, and tutorial for the DGD model trained on bulk RNA-Seq data.

License: GNU General Public License v3.0

Python 100.00%

deep-generative-model deep-generative-modelling deep-generative-models neural-network rna-seq rna-seq-analysis rnaseq rnaseq-analysis

bulkdgd's Introduction

bulkDGD is a Python package providing an interface to use the Deep Generative Decoder (DGD) developed by Schuster and Krogh (Schuster and Krogh, 2023) to model the gene expression of healthy human tissues from bulk RNA-Seq data.

The first application of the model to bulk RNA-Seq data is presented in the work of Prada-Luengo, Schuster, Liang, and coworkers (Prada-Luego, Schuster, Liang, et al., 2023).

Documentation: bulkDGD's documentation can be found here.
Bug reports: please report any bugs or problems you encounter with bulkDGD in the dedicated issues section on GitHub.

License

bulkDGD is freely available under the terms of the GNU General Public License (Version 3, 29 June 2007).

References

(Schuster and Krogh, 2023) Schuster, Viktoria, and Anders Krogh. "The Deep Generative Decoder: MAP estimation of representations improves modelling of single-cell RNA data." Bioinformatics 39.9 (2023): btad497.

(Prada-Luengo, Schuster, Liang, et al., 2023) Prada-Luengo, Iñigo, et al. "N-of-one differential gene expression without control samples using a deep generative model." Genome Biology 24.1 (2023): 263.

bulkdgd's People

Contributors

Stargazers

Watchers

Forkers

magnetochina nermin-ghith edurlaf

bulkdgd's Issues

Feature: get_recount3_data - query based filtering for SRA data

Feature:
SRA attribute specific query based filtering when using the dgd_get_recount3_data cli tool to download SRA data from recount3.

Explanation:
When using dgd_get_recount3_data.py for downloading SRA data, SRA-specific attributes are currently gathered in one column. This makes the attributes unavailable for query based filtering when using dgd_get_recount3_data.py.

Solution:
The SRA specific attributes should be parsed, added to colmns in the metadata file and to the metadata_fields file.

adding optimizations

hi there!
I'm running DGD on my own data (not from tutorials), and I'm getting really high loss at the end of two optimizations:

INFO:bulkDGD.core.model:Epoch 50: loss 14.552, epoch CPU time 0.780 s, backward step CPU time 0.676 s, epoch wall clock time 0.563 s, backward step wall clock time 0.469 s.

I am really new to machine learning so please bear with me. Should I just be adding more optimizations in the format of the provided yaml files?

Feature: Add a function for bulk download of recount3 data while allowing query filtering.

Main functionality:
The function should download one data in bulk using a CSV file as input.

Issue reason:
This feature would make downloading data from Recount3 in bulk possible while applying a specific query_string for metadata-based and attribute-based filtering on each dataset.

Solution:
This idea is for a wrapper around the recount3 data getter CLI tool, bulkDGD/execs/dgd_get_recount3_data.py.

The wrapper takes a CSV file and a download path as input.
The CSV file should hold the data specific inputs for dgd_get_recount3_data.py and apply these to the data getter function along with the download path.

Comment:
First, this tool should be available as a function under the recount3 module as an extension to #3 .
Later, it should be made available as a CLI tool.

Memory use and issues with fold change output

Hi,
We have two questions / issues at hand.

We have been using bulkDGD on another dataset, with a fairly large sample size of approx. 10.000. When attempting to run bulkDGD on all samples at once, we quickly run out of memory (> 200 gb). We have instead attempted to process fewer samples at a time which requires significantly less memory but takes longer time. Is this to be expected? And is our approach compatible with your code?

The other issue we are dealing with is the fold change output from the DEA is oddly distributed. It seems to have some weird cut offs. Do you have a suggestion as to what could have gone wrong? The issue is illustrated on the figure below, where we have a comparison of the log fold changes computed using the mean expression of our entire dataset vs using bulkDGD.

FC value

In the process of using bulkDGD, I encountered two issues:

In the file bulkDGD/analysis/dea.py, specifically in the function get_log2_fold_changes at line 687, the command is: "torch.log2((pred_means + 1e-6) / (obs_counts + 1e-6))". This command indicates that the denominator of the FC (Fold Change) is the true gene expression value, while the numerator is the predicted expression. This is opposite to the conventional way of calculating FC.
Calculating |log2FC| for tutorials and recount3 BRCA expression count often results in a range of 10-30, e.g., the log2FC for EGFR is mostly less than 10. The log2FC values are smaller than the log2FC for EGFR in the article Figure 4D-F (Prada‑Luengo et al., Genome Biology).

center-for-health-data-science / bulkdgd Goto Github PK

bulkdgd's Introduction

bulkdgd's People

Contributors

Stargazers

Watchers

Forkers

bulkdgd's Issues

Feature: get_recount3_data - query based filtering for SRA data

adding optimizations

Feature: Add a function for bulk download of recount3 data while allowing query filtering.

Memory use and issues with fold change output

FC value

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent