Giter Site home page Giter Site logo

center-for-health-data-science / bulkdgd Goto Github PK

View Code? Open in Web Editor NEW
7.0 8.0 3.0 25.12 MB

Code, documentation, and tutorial for the DGD model trained on bulk RNA-Seq data.

License: GNU General Public License v3.0

Python 100.00%
deep-generative-model deep-generative-modelling deep-generative-models neural-network rna-seq rna-seq-analysis rnaseq rnaseq-analysis

bulkdgd's Introduction


Documentation Status

bulkDGD is a Python package providing an interface to use the Deep Generative Decoder (DGD) developed by Schuster and Krogh (Schuster and Krogh, 2023) to model the gene expression of healthy human tissues from bulk RNA-Seq data.

The first application of the model to bulk RNA-Seq data is presented in the work of Prada-Luengo, Schuster, Liang, and coworkers (Prada-Luego, Schuster, Liang, et al., 2023).

  • Documentation: bulkDGD's documentation can be found here.
  • Bug reports: please report any bugs or problems you encounter with bulkDGD in the dedicated issues section on GitHub.

License

bulkDGD is freely available under the terms of the GNU General Public License (Version 3, 29 June 2007).

References

(Schuster and Krogh, 2023) Schuster, Viktoria, and Anders Krogh. "The Deep Generative Decoder: MAP estimation of representations improves modelling of single-cell RNA data." Bioinformatics 39.9 (2023): btad497.

(Prada-Luengo, Schuster, Liang, et al., 2023) Prada-Luengo, Iñigo, et al. "N-of-one differential gene expression without control samples using a deep generative model." Genome Biology 24.1 (2023): 263.

bulkdgd's People

Contributors

anderslv avatar valesora avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bulkdgd's Issues

Feature: get_recount3_data - query based filtering for SRA data

Feature:
SRA attribute specific query based filtering when using the dgd_get_recount3_data cli tool to download SRA data from recount3.

Explanation:
When using dgd_get_recount3_data.py for downloading SRA data, SRA-specific attributes are currently gathered in one column. This makes the attributes unavailable for query based filtering when using dgd_get_recount3_data.py.

Solution:
The SRA specific attributes should be parsed, added to colmns in the metadata file and to the metadata_fields file.

adding optimizations

hi there!
I'm running DGD on my own data (not from tutorials), and I'm getting really high loss at the end of two optimizations:

INFO:bulkDGD.core.model:Epoch 50: loss 14.552, epoch CPU time 0.780 s, backward step CPU time 0.676 s, epoch wall clock time 0.563 s, backward step wall clock time 0.469 s.

I am really new to machine learning so please bear with me. Should I just be adding more optimizations in the format of the provided yaml files?

Feature: Add a function for bulk download of recount3 data while allowing query filtering.

Main functionality:
The function should download one data in bulk using a CSV file as input.

Issue reason:
This feature would make downloading data from Recount3 in bulk possible while applying a specific query_string for metadata-based and attribute-based filtering on each dataset.

Solution:
This idea is for a wrapper around the recount3 data getter CLI tool, bulkDGD/execs/dgd_get_recount3_data.py.

  • The wrapper takes a CSV file and a download path as input.
  • The CSV file should hold the data specific inputs for dgd_get_recount3_data.py and apply these to the data getter function along with the download path.

Comment:
First, this tool should be available as a function under the recount3 module as an extension to #3 .
Later, it should be made available as a CLI tool.

Memory use and issues with fold change output

Hi,
We have two questions / issues at hand.

We have been using bulkDGD on another dataset, with a fairly large sample size of approx. 10.000. When attempting to run bulkDGD on all samples at once, we quickly run out of memory (> 200 gb). We have instead attempted to process fewer samples at a time which requires significantly less memory but takes longer time. Is this to be expected? And is our approach compatible with your code?

The other issue we are dealing with is the fold change output from the DEA is oddly distributed. It seems to have some weird cut offs. Do you have a suggestion as to what could have gone wrong? The issue is illustrated on the figure below, where we have a comparison of the log fold changes computed using the mean expression of our entire dataset vs using bulkDGD.

3125a366-b509-4aee-8e4e-1f40baeac1ab

FC value

In the process of using bulkDGD, I encountered two issues:

  1. In the file bulkDGD/analysis/dea.py, specifically in the function get_log2_fold_changes at line 687, the command is: "torch.log2((pred_means + 1e-6) / (obs_counts + 1e-6))". This command indicates that the denominator of the FC (Fold Change) is the true gene expression value, while the numerator is the predicted expression. This is opposite to the conventional way of calculating FC.
  2. Calculating |log2FC| for tutorials and recount3 BRCA expression count often results in a range of 10-30, e.g., the log2FC for EGFR is mostly less than 10. The log2FC values are smaller than the log2FC for EGFR in the article Figure 4D-F (Prada‑Luengo et al., Genome Biology).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.