<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Updated to-do: To-do Enformer evaluation <p dir

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Implement a sequence quality metric based on existing neural networks that can predict enhancer activity (Bert based), expression (Enformer) or chromatin accessibility (BPnet) about dna-diffusion HOT 32 CLOSED

pinellolab commented on June 10, 2024 1

Implement a sequence quality metric based on existing neural networks that can predict enhancer activity (Bert based), expression (Enformer) or chromatin accessibility (BPnet)

from dna-diffusion.

Comments (32)

meuleman commented on June 10, 2024 1

So in principle these are all available from the ENCODE project portal, e.g. see here:
https://www.encodeproject.org/publication-data/ENCSR686ZBA/
However, I now noticed that these are just the peak files and I'm not sure there is a direct route to get the full set of 733 bigWig files (I'm asking around at the moment, will report back when I hear more).

So I took a less direct route and used the project's 733 biosample metadata spreadsheet to identify bigWig files on the ENCODE portal. Mostly due to replicates, it's not straightforward to find all 733 files, but if you mostly care about getting the highest quality bigWig per each cell type/state, then you can use this list of 515 bigWig files:
https://www.dropbox.com/s/qdp0mzua2brgd33/metadata_733_515_WM20221026.tsv?dl=0
The second column corresponds to the ENCODE experiment ID, which you can find back in the metadata spreadsheet in case you want more information on the biosample data.

Hope this helps!

from dna-diffusion.

aaronwtr commented on June 10, 2024 1

@younwoochoi I will work out a concrete plan to tackle the Enformer validation. That will give us some concrete items we can work on.

from dna-diffusion.

younwoochoi commented on June 10, 2024 1

To-do Enformer evaluation

Experimental data:

dataloader.EnformerDataloaderDNase Load DNAse data into DataFrame (Assignee(s): @aaronwtr)

utils.sig_to_chr Map signal onto chr1 (Assignee(s): @aaronwtr)

Enformer:

dataloader.EnformerDataloaderDNase Get all genes on chr1 and collect TSS genomic coordinates (Assignee(s): @aaronwtr )

dataloader.EnformerDataloaderDNase Extend gene scope to 196,608 centered around TSS. Note: you can use pre-existing function out of dataloader.EnformerDataloaderABC (Assignee(s): @aaronwtr)

dataloader.EnformerDataloaderDNase Get one-hot encoded sequence corresponding to genomic coordinates. Note: you can use pre-existing functions out of dataloader.EnformerDataloaderABC (Assignee(s): @aaronwtr)

inference.EnformerModel Run inference for the experimental DNAse data (Assignee(s): @aaronwtr)

Evaluation:

eval.EnformerEvaluation Implement poisson loss and PCC akin to Lucidrains implementation (Assignee(s): @younwoochoi

eval.EnformerEvaluation Run and assess comparison between experiment and Enformer (Assignee(s): @aaronwtr, @younwoochoi )

I can help with the two evaluation tasks. I already implemented the metrics for the first task and it's ready to use.

from dna-diffusion.

aaronwtr commented on June 10, 2024 1

Updated to-do:

To-do Enformer evaluation

Experimental data (@aaronwtr , @mihirneal):

dataloader.EnformerDataloaderDNase Load DNAse data into DataFrame (Assignee(s): @aaronwtr)
utils.sig_to_chr Map signal onto chr1 in 128 bins (Assignee(s): @aaronwtr)

Enformer (@aaronwtr, Gabriel, @younwoochoi):

dataloader.EnformerDataloaderDNase Get all genes on chr1 and collect TSS genomic coordinates (Assignee(s): @aaronwtr)
dataloader.EnformerDataloaderDNase Extend gene scope to 196,608 centered around TSS. Note: you can use pre-existing function out of dataloader.EnformerDataloaderABC (Assignee(s): @aaronwtr )
dataloader.EnformerDataloaderDNase Get one-hot encoded sequence corresponding to genomic coordinates. Note: you can use pre-existing functions out of dataloader.EnformerDataloaderABC (Assignee(s): @aaronwtr)
inference.EnformerModel Run inference for the experimental DNAse data (Assignee(s): @aaronwtr)

Evaluation:

eval.EnformerEvaluation Implement poisson loss and PCC akin to Lucidrains implementation (Assignee(s): @aaronwtr, @younwoochoi)
eval.EnformerEvaluation Run and assess comparison between experiment and Enformer (Assignee(s): @aaronwtr, @younwoochoi)

from dna-diffusion.

aaronwtr commented on June 10, 2024

Hi @LucasSilvaFerreira and @sg134, I'd like to help out with this issue! Let me see if I understand correctly, and please correct me if I am misunderstanding.

The idea would be to utilize pre-existing sequence models and use them to predict the qualities of our generated sequences, i.e. enhancer activity, gene expression, and chromatin accessibility. How we get to a metric from these models is TBD. This issue would require applying pre-trained or retrained distributions of these pre-existing models on our dataset and from this developing some metric assessing the quality of regulatory sequence as measured by the previously mentioned qualities.

@sg134 maybe we can schedule a meeting to talk a bit about this issue and distribute some tasks? I find that more productive than going back and forth via thread. If you're up for it, let me know your timezone and some date/times that would work for you.

from dna-diffusion.

sg134 commented on June 10, 2024

Hi @aaronwtr, would be great to further discuss and plan out the development of these models. Sent a DM on Discord regarding availability

from dna-diffusion.

jaewshin commented on June 10, 2024

Hi @aaronwtr @sg134, I'd also like to work on this! If you guys have already discussed the plans, could you let me know if/how I can contribute?

from dna-diffusion.

sg134 commented on June 10, 2024

Hi @jaewshin , before the weekly meeting, @aaronwtr and I discussed splitting up the work based on the tasks (i.e. I can work on the enhancer prediction activity while Aaron can work on the gene expression prediction). However, Luca suggested to first focus on training models to predict chromatin accessibility. Do you guys have any suggestions/recommendations for how we can split up the work between the 3 of us? For example, we can each pursue training different types of models for chromatin accessibility and share results.

from dna-diffusion.

aaronwtr commented on June 10, 2024

Hi @jaewshin, I partly agree with @sg134 that we can all work in parallel on the same task, i.e. chromatin accessibility. However, I'm not sure whether it be most productive if all of us work on different architectures. I'm not familiar with chromatin accessibility prediction, but I reckon there'll be a model that currently achieves SOTA performance. I'd say we just focus on that one in particular and then divide tasks into the different parts of the implementation. Personally, I'd also like to continue on implementing Enformer as I think it'll be valuable to see if we can generate regulatory sequences that interact with genes as we would expect, e.g. using actual regulatory sequences as validation. I think Enformer can definitely help out in this respect.

Since @lucapinello indeed mentioned he'd like us to focus on the chromatin accessibility prediction, maybe he could elaborate a bit on his vision of this issue? What would be some of the objectives we should aim to achieve?

from dna-diffusion.

lucapinello commented on June 10, 2024

Hi guys, yes it may be nice to focus first only on two models for chromatin accessibility prediction

I propose to train these two models:

We can start with this:

Deopen: https://github.com/kimmo1019/Deopen https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6192215/

Now for the train it is important to use the same matrix we are using to run the diffusion (160K sequences)

2)BPNET: https://github.com/kundajelab/bpnet Bpnet, for example they trained it here on ATAC: https://www.sciencedirect.com/science/article/pii/S0092867421009429
Original paper: https://www.nature.com/articles/s41588-021-00782-6

Bpnet predicts the signal intensity so we need to recover bigwig files if we want to train this model @meuleman do you have bigwig files already available?

Happy to find a time to arrange a meeting to discuss this further.

Thanks!

from dna-diffusion.

meuleman commented on June 10, 2024

Yes bigwigs of normalized signal densities are readily available for all 733 biosamples

…

On Fri, Oct 21, 2022, 1:47 PM Luca Pinello ***@***.***> wrote: Hi guys, yes it may be nice to focus first only on two models for chromatin accessibility prediction I propose to train these two models: We can start with this: 1. Deopen: https://github.com/kimmo1019/Deopen https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6192215/ Now for the train it is important to use the same matrix we are using to run the diffusion (160K sequences) Next: 2)BPNET: https://github.com/kundajelab/bpnet Bpnet, for example they trained it here on ATAC: https://www.sciencedirect.com/science/article/pii/S0092867421009429 Original paper: https://www.nature.com/articles/s41588-021-00782-6 Bpnet predicts the signal intensity so we need to recover bigwig files if we want to train this model @meuleman <https://github.com/meuleman> do you have bigwig files already available? Happy to find a time to arrange a meeting to discuss this further. Thanks! — Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIMKAJIXHWF2AFGFSGJV5TWEL6P5ANCNFSM6AAAAAARHWG5EU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from dna-diffusion.

aaronwtr commented on June 10, 2024

Update: Just cloned Deopen into our repository. Also started working on a data preprocessing pipeline that takes in our DHS data and returns data in .bed format which is required for training Deopen. The preprocessing is currently functional for the positive samples. Code can be found here.

The next step is to generate background sequences that serve as negative training samples. We can use this method for this.

from dna-diffusion.

sg134 commented on June 10, 2024

@meuleman , how/where can we access the bigwig files?

from dna-diffusion.

aaronwtr commented on June 10, 2024

Update on the Deopen implementation: with the kind help of Manuel Tognon out of @lucapinello's lab, the script for generating negatives should be working in Python. However, it seems that the bedtools software has issues interpreting our genomic coordinates. Specifically it is giving an out of bounds error that seems to suggest the genomic locations from @meuleman's dataset do not align with whatever bedtools is using. The input for my run has been:
python gen_null_seqs.py --bed positive.bed --genome hg38.fa --mask hg38.trf.bed --out negative.bed --outformat bed

The error message is as follows:

It says it is skipping the sequences that fall out of bounds but since no file is saved in the end, I assume that this is the case for all of the sequences. I know Manuel added the functionality to save as bed recently so it might not yet be working properly. I'll ask him about this, but in the meantime maybe any of you might know a workaround or what is going wrong here in the first place? @lucapinello @meuleman

from dna-diffusion.

lucapinello commented on June 10, 2024

I am wondering if this is a bug of the script @ManuelTgn created? @ManuelTgn can you please check with the inputs that @aaronwtr is using?

from dna-diffusion.

aaronwtr commented on June 10, 2024

@ManuelTgn This is the bed file I am trying to get negative sequences for: positive.txt

You only have to change the extension from .txt to .bed as github does not allow me to upload .bed files for some reason.

from dna-diffusion.

meuleman commented on June 10, 2024

Yeah those coordinates certainly don't come from the input data I provided -- looks like the coordinates shown here are indeed way off for most chromosomes. Perhaps related to the way the negative set is defined? Is it assuming all chromosomes are of equal length, perchance? Happy to chat about this and do a walk-through on a call if that would help.

…

-Wouter.

On Sun, Nov 6, 2022 at 4:05 AM aaronwtr ***@***.***> wrote: @ManuelTgn <https://github.com/ManuelTgn> This is the bed file I am trying to get negative sequences for: positive.txt <https://github.com/pinellolab/DNA-Diffusion/files/9945764/positive.txt> You only have to change the extension from .txt to .bed as github does not allow me to upload .bed files for some reason. — Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIMKAI6UWSRBEWC5OCRYY3WG6NIBANCNFSM6AAAAAARHWG5EU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from dna-diffusion.

aaronwtr commented on June 10, 2024

The culprit was a bug in the code generating the positive.bed file. Thanks everyone for the help!

from dna-diffusion.

ManuelTgn commented on June 10, 2024

Hi everyone,
I extensively tested the script on different datasets and I never run in that error.
@aaronwtr I reproduced the error just running BEDTools on the input genomic coordinates outside of the script. It looks like the coordinates are wrong. Could they have been mapped on a different genome version (e.g. hg19)? @aaronwtr are you using a script to recover these coordinates? If you need help with it, I can get a look into it and maybe figure out what is not working

from dna-diffusion.

aaronwtr commented on June 10, 2024

To-do list Enformer implementation
High priority:

Genomic track ID implementation supplementary table 2 Enformer paper (Assignee(s): )
Enformer evaluation and sanity check (Assignee(s): @aaronwtr )

Low priority:

Refactor Google Colab such that it is runnable on GPU (Assignee(s): )
Add input type checker, should be Ensemble ID, or map gene name --> Ensemble ID (Assignee(s): )
Map gene from source genome --> hg38 if it has not been assembled with hg38 (Assignee(s): )

from dna-diffusion.

younwoochoi commented on June 10, 2024

To-do list Enformer implementation High priority:

Genomic track ID implementation supplementary table 2 Enformer paper (Assignee(s): @younwoochoi)

Enformer evaluation and sanity check (Assignee(s): @aaronwtr, @younwoochoi )

Low priority:

Refactor Google Colab such that it is runnable on GPU (Assignee(s): )

Add input type checker, should be Ensemble ID, or map gene name --> Ensemble ID (Assignee(s): )

Map gene from source genome --> hg38 if it has not been assembled with hg38 (Assignee(s): )

I can help with the first two TODOs

from dna-diffusion.

kierandidi commented on June 10, 2024

To-do list Enformer implementation High priority:

Genomic track ID implementation supplementary table 2 Enformer paper (Assignee(s): @kierandidi @younwoochoi)

Enformer evaluation and sanity check (Assignee(s): @aaronwtr, @kierandidi @younwoochoi )

Low priority:

Refactor Google Colab such that it is runnable on GPU (Assignee(s): )

Add input type checker, should be Ensemble ID, or map gene name --> Ensemble ID (Assignee(s): )

Map gene from source genome --> hg38 if it has not been assembled with hg38 (Assignee(s): )

I can also help with the first two TODOs

from dna-diffusion.

younwoochoi commented on June 10, 2024

Hi @aaronwtr, as @kierandidi suggested in the meeting, I coded to create a data frame containing the output, assay_type, and target. I'm not sure where to push the code. Feel free to let me know if there is anything you want to edit/improve! @kierandidi

from dna-diffusion.

aaronwtr commented on June 10, 2024

Hi @younwoochoi, looking good! Thanks for this. One thing I'd change would be to get rid of the assay type in the targets. In the DeepMind table, they refer to the target as the combination of assay type and cell type. Since we already have the assay type in a separate column, having this in the target column as well seems a bit redundant. Hence, change the name of the target column from 'target' -> 'cell_type' and get rid of the assay types in the cell type column (you could use Python's str.split() for this).

With respect to pushing you could do two things:

Create a separate development branch for the enformer implementation in this repository ('enformer-implementation-dev-{YOUR NAME}') and prepare a pull request from your branch -> 'enformer-implementation' branch.
Fork 'enformer-implementation' and submit a pull request from your fork back into 'enformer-implementation'.

You can mark me as a reviewer for the PR, then I will make sure to merge everything correctly. See bf942b2 for how to submit a pull request in this repository.

Let me know if I can help with anything else!

from dna-diffusion.

kierandidi commented on June 10, 2024

Great work @younwoochoi! If it is alright I can edit your code in line with what @aaronwtr suggested and will make a PR today indicating us two as contributors.

from dna-diffusion.

younwoochoi commented on June 10, 2024

Sure! Thank you @aaronwtr @kierandidi .

from dna-diffusion.

kierandidi commented on June 10, 2024

Hey @aaronwtr @younwoochoi, I just created a PR from my fork of the repo after incorporating the suggestion of removing the assay information from the target column. Hope that the PR looks as it should do, let me know if there is anything to improve!

from dna-diffusion.

aaronwtr commented on June 10, 2024

That's great @kierandidi, thanks! I will have a look at it and merge it into the 'enformer-implementation' branch if there are no issues. I'll also make sure to update the codebase with the code you suggested to keep the codebase up-to-date with the latest changes.

from dna-diffusion.

younwoochoi commented on June 10, 2024

Thank you @kierandidi @aaronwtr . How can we help with the second TODO?

from dna-diffusion.

aaronwtr commented on June 10, 2024

To-do Enformer evaluation

Experimental data:

dataloader.EnformerDataloaderDNase Load DNAse data into DataFrame (Assignee(s): @aaronwtr)
utils.sig_to_chr Map signal onto chr1 (Assignee(s): @aaronwtr)

Enformer:

dataloader.EnformerDataloaderDNase Get all genes on chr1 and collect TSS genomic coordinates (Assignee(s): @aaronwtr)
dataloader.EnformerDataloaderDNase Extend gene scope to 196,608 centered around TSS. Note: you can use pre-existing function out of dataloader.EnformerDataloaderABC (Assignee(s): @aaronwtr )
dataloader.EnformerDataloaderDNase Get one-hot encoded sequence corresponding to genomic coordinates. Note: you can use pre-existing functions out of dataloader.EnformerDataloaderABC (Assignee(s): @aaronwtr)
inference.EnformerModel Run inference for the experimental DNAse data (Assignee(s): @aaronwtr)

Evaluation:

eval.EnformerEvaluation Implement poisson loss and PCC akin to Lucidrains implementation (Assignee(s): )
eval.EnformerEvaluation Run and assess comparison between experiment and Enformer (Assignee(s): @aaronwtr)

from dna-diffusion.

github-actions commented on June 10, 2024

This issue is stale because it has been open for 60 days with no activity.

from dna-diffusion.

github-actions commented on June 10, 2024

This issue was closed because it has been inactive for 7 days since being marked as stale.

from dna-diffusion.

Implement a sequence quality metric based on existing neural networks that can predict enhancer activity (Bert based), expression (Enformer) or chromatin accessibility (BPnet) about dna-diffusion HOT 32 CLOSED

Comments (32)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent