tbepler / prose Goto Github PK

View Code? Open in Web Editor NEW

92.0 3.0 19.0 232 KB

Multi-task and masked language model-based protein sequence embedding models.

License: Other

Python 100.00%

protein-sequences language-model sequence-embedding protein-embedding deep-learning representation-learning

prose's People

Contributors

Stargazers

Watchers

Forkers

jhird iceshadows xiaoxingxingheshang superman1024 afukada-mkh dongcf lipi12q olivierbeq rocke2020 gienah mehakrafiq francoiszg rnaimehaom sarwanpasha biocoder007 matteotolloso opus-lightphenexx guruace liudan111

prose's Issues

Controlling length of output sequences

Thank you for your great work! I was just wondering how I would modify the code to control the output length, as running the SkipLSTM model produces a Tensor of dimensions [386, 6165] (no pooling) when run on my pre-aligned sequences. I would like to produce much shorter representations for each of the 386 sequence components. Thank you.

Stucture similarity confusion

Hello, sir!

I'm trying to use embeddings for distance calculation between protein sequences. In both papers (2019, 2021), you proposed the "soft sequence alignment" method.

I have several questions regarding the SSA:

In the recent paper (2021), the size of the embeddings was reduced via linear projection from ~6K to 100. Here, by default, the model outputs full embeddings. Do you think the SSA would apply to these "full" embeddings? I noticed that in the SkipLSTM encoder, there is a proj attribute which I assume relates to the linear projection matrix. Can I apply this layer to the embeddings to get the same representation as the one used in the paper for distance calculation?
Implementing the SSA, I stumbled upon ambiguity. Namely, the formula for calculation of the alpha and beta (a_ij and b_ij) parameters have the normalization summed over n and m sequence elements, respectively (sum_l^n (k_il), sum_l^m (k_lj)). However, i and j are indices reserved for the elements of the first and the second sequences, resp. For instance, in the first sum (sum_l^n (k_il)), we fix the i and sum over each l-th element of the second sequence, m in total (and not n). Summing up, could you please clarify how to correctly calculate the normalization constant in the "alpha" and "beta" parameters.
In the paper, you used full protein sequences from UniProt. In my case, I use partial (domain) subsequences and calculate the distance between them. Do you think such distance would still be meaningful?

I realize that's a lot of abstract questions, so any insight you could give is highly appreciated!

Ivan

length limit for the pretrained model

Thanks for your great work! I am gonna use it to generate embeddings for each of individual downstream tasks. Just wondering is there any limit for the length of input sequence.

About proteins embedding

Thanks for your great work! Because of the length of time, I read it roughly.
As your paper write: Our encoder consists of 3 biLSTM layers with 512 hidden units each and a ﬁnal output embedding dimension of 100.

As your test code about pretrain model:

python embed_sequences.py --pool avg -o data/demo.h5 data/demo.fa

the output dimension is 6165.
If I only want to get the embedding result(dimension of 100).How I get it?
And what is the 6165 means?

tbepler / prose Goto Github PK

prose's People

Contributors

Stargazers

Watchers

Forkers

prose's Issues

Controlling length of output sequences

Stucture similarity confusion

length limit for the pretrained model

About proteins embedding

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent