Giter Site home page Giter Site logo

pinellolab / dna-diffusion Goto Github PK

View Code? Open in Web Editor NEW
336.0 12.0 45.0 104.4 MB

🧬 Generative modeling of regulatory DNA sequences with diffusion probabilistic models 💨

Home Page: https://pinellolab.github.io/DNA-Diffusion/

License: Other

Jupyter Notebook 99.44% Python 0.54% Dockerfile 0.01% Shell 0.01% Makefile 0.01%
deep-learning diffusion-models genomics regulatory-genomics stable-diffusion generative-model diffusion-probabilistic-models

dna-diffusion's People

Contributors

1edv avatar aaronwtr avatar allcontributors[bot] avatar cameronraysmith avatar hssn-20 avatar ihabbendidi avatar jxilt avatar lucapinello avatar lucassilvaferreira avatar mansoldm avatar mateibejan1 avatar meuleman avatar mihirneal avatar noahweber1 avatar nz99 avatar renovate[bot] avatar ryams avatar sauravmaheshkar avatar ssenan avatar ttunja avatar zanussbaum avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dna-diffusion's Issues

Create function motif_count Input

def motif_count Input(sequences, return_probability=False)

'sequences': list of sequences to scan
return_probability: if True scale the occurrences based on the total number of sequences in input
Return: pandas dataframe with index 'motif_id_name' value: 'occurences' (or 'probability') of motif in the sequences

Implement score-interpolation

GOAL : Implements score-interpolation described in Continuous diffusion for categorical data

Please see Figure 3 below for overall implementation picture requirements.

image

Up-to-date notebook progress could be found at https://github.com/pinellolab/DNA-Diffusion/blob/score_interpolation_latest/notebooks/experiments/conditional_diffusion/dna_diff_baseline_conditional_UNET.ipynb , click on history button to check actual code commits modifications.

DONE :

  1. Added DPM-Solver++ ODE solver block
  2. Added Interpolate Embeddings block
  3. Added reparameterization trick for p_losses()
  4. Fed the output of Time Embedding stage into LayerNorm inside Unet model

TODO :

  • Enable the self_conditioning logic properly without runtime dimension issue

  • Verify the Interpolate Embeddings block

  • Verify that the LearnedSinusoidalPosEmb class is actually following the theoretical concept behind Random Fourier Embedding block

  • Enable the initial denoising inside p_losses() properly considering the actual mean and std at denoising timestep t=0

  • Implement all aspects of Input embedding stage properly for the Interpolate scores block

  • Enhance the denoising process with noise = sigma * epsilon where sigma is a learned NN parameter, such that the noise will be a deterministic function of the input and the parameter rather than being independent and randomly generated as suggested by chatGPT

  • Verify the integration of Time Embedding stage into Unet model

  • Enhance the training pipeline using modified version of tweedie formula in soft-diffusion paper

Create a Data Loader Class with Pytorch Lightning

For the first version make sure the latest version provided by @meuleman here:

https://www.meuleman.org/research/synthseqs/#material

Specifically, we have the following datasets available:

training set: 160k sequences, 10k per NMF component (chr3-chrY, .csv.gz)
validation set: 16k sequences, 1k per NMF component (chr2 only, .csv.gz)
test set: 16k sequences, 1k per NMF component (chr1 only) .csv.gz)

Each of these contains the genomic locations (human genome assembly hg38, first 3 columns) of accessible genome elements, their majority NMF component (column: component, see legend figure above) as well as their 200bp nucleotide sequence (column: raw_sequence).

Column Example Description
seqname chr16 Chromosome
start 68843660 Start position
end 68843880 End position
DHS_width 220 Width of original DHS
summit 68843790 Position of center of mass of DHS
total_signal 122.770678 Sum of DNase-seq signal across biosamples
numsamples 61 Number of biosamples with the DHS
raw_sequence GAGGCATTG… 200bp extracted nucleotide sequence
component 1 Dominant DHS Vocabulary component
proportion 0.767371514 Proportion of NMF loadings in this component

remove merged/unused branches

We would like to clean up branches that have been merged into the default branch while making sure we do not delete existing branches where work is ongoing. Please comment here with the name of the branch you are currently working on and a projected timeframe to merge.

Create a function called KL_divergence_motifs

def KL_divergence_motifs(original, generated)

'original': list or pandas Series with original sequences used to train the model
'generated':list or pandas Series with generated sequences used to train the model

return a 'score' (float) calculating the Kullback-Leibler divergence

What is a contribution? contribution file linked to read me

Discussed in #22

Originally posted by IhabBendidi October 17, 2022
An idea is to make a contribution file, to keep account of contributors.
But what is a contribution?

A code push? that would not count the think tanks that theorized the project, so it is unfair. At the same time, what counts as contibuting discussion, and what isn't? Lines can be blurry in some edge cases.

What are your thoughts?

Contribution list is gonna become huge, and we need to keep proper track of it, so a file would be better, that would be linked to the readme

Use reformers to greatly speed up training time and reduce memory usage

Transformers (like the ones used in your model) have a memory and time complexity of O(n^2). Reformers (explained here) have a memory and time complexity of O(n log n) and, as such, require far less ram and compute to train.

I am strongly suggesting you use reformers since they can replace transformers with almost no change and with a massive benefit. For reference, replacing transformers with reformers on one of my projects took the memory usage down from >10GB to <1GB and was able to train at 45 mins per epoch on an 8-core CPU (I was testing it out before migrating to colab)

Implement a sequence quality metric based on k-mer composition

Roughly this is the idea: using a set of sequences (e.g. training set or generated set) we write first them in a temporary fasta file, using the obtained file (e.g. sequences.fa) call the kat hist function to get an histogram of kmers count. Here we can use k=7. Then we rescale the number count obtained of each k-mer based on the total sum to get probabilities of k-mers. Now assuming you have two probabilities vectors for k-mer corresponding to train sequences and generated sequences you can calculate the KL-diverge using the pytorch implementation that Cesar is also using. See: #14

Implement a sequence quality metric based on existing neural networks that can predict enhancer activity (Bert based), expression (Enformer) or chromatin accessibility (BPnet)

@sg134 Can you help to explore sequences -> enhancer models?
We can use these enhancer classifiers as orthogonal metrics to evaluate our synthetic sequences.
Probably we can find tons of these classifiers, but here is someplace to start your search:

bert enhancer
DeepEnhancer

What should be the focus:

  • Sequence length?
  • Do we need to retrain it? Can we retrain it on our 16 cell types?
  • Create a function to receive a sequence and report the probability of it being classified as an enhancer.

Develop a standardized PR ticket template

My interaction over at #50 made me think:

We may or may not have previously talked about setting a template for opening new pull requests. The former interaction has a call for:

  1. A link to the ticket it's addressing
  2. A small explanation so as to why (based on what though?).

Maybe these two could be good starting points for a default ticket template? A la, copy and paste then fill in with your contribution. Additionally maybe, add if your contribution is going into the codebase/experiments/other folder (or where in the codebase, once it grows), although this is probably filled by 1 already. Anything else you guys consider relevant or good/standard practice?

Design config file

Design a template config file which we can use to decide the what parameters the train and sampler parsers have to receive and pass to pl.Trainer and what the Diffusion and UNet classes final parameters are.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.