Giter Site home page Giter Site logo

tbepler / prose Goto Github PK

View Code? Open in Web Editor NEW
92.0 3.0 19.0 232 KB

Multi-task and masked language model-based protein sequence embedding models.

License: Other

Python 100.00%
protein-sequences language-model sequence-embedding protein-embedding deep-learning representation-learning

prose's People

Contributors

tbepler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

prose's Issues

Controlling length of output sequences

Thank you for your great work! I was just wondering how I would modify the code to control the output length, as running the SkipLSTM model produces a Tensor of dimensions [386, 6165] (no pooling) when run on my pre-aligned sequences. I would like to produce much shorter representations for each of the 386 sequence components. Thank you.

Stucture similarity confusion

Hello, sir!

I'm trying to use embeddings for distance calculation between protein sequences. In both papers (2019, 2021), you proposed the "soft sequence alignment" method.

I have several questions regarding the SSA:

  1. In the recent paper (2021), the size of the embeddings was reduced via linear projection from ~6K to 100. Here, by default, the model outputs full embeddings. Do you think the SSA would apply to these "full" embeddings? I noticed that in the SkipLSTM encoder, there is a proj attribute which I assume relates to the linear projection matrix. Can I apply this layer to the embeddings to get the same representation as the one used in the paper for distance calculation?
  2. Implementing the SSA, I stumbled upon ambiguity. Namely, the formula for calculation of the alpha and beta (a_ij and b_ij) parameters have the normalization summed over n and m sequence elements, respectively (sum_l^n (k_il), sum_l^m (k_lj)). However, i and j are indices reserved for the elements of the first and the second sequences, resp. For instance, in the first sum (sum_l^n (k_il)), we fix the i and sum over each l-th element of the second sequence, m in total (and not n). Summing up, could you please clarify how to correctly calculate the normalization constant in the "alpha" and "beta" parameters.
  3. In the paper, you used full protein sequences from UniProt. In my case, I use partial (domain) subsequences and calculate the distance between them. Do you think such distance would still be meaningful?

I realize that's a lot of abstract questions, so any insight you could give is highly appreciated!

Ivan

length limit for the pretrained model

Thanks for your great work! I am gonna use it to generate embeddings for each of individual downstream tasks. Just wondering is there any limit for the length of input sequence.

About proteins embedding

Thanks for your great work! Because of the length of time, I read it roughly.
As your paper write: Our encoder consists of 3 biLSTM layers with 512 hidden units each and a final output embedding dimension of 100.
image

As your test code about pretrain model:

python embed_sequences.py --pool avg -o data/demo.h5 data/demo.fa

the output dimension is 6165.
If I only want to get the embedding result(dimension of 100).How I get it?
And what is the 6165 means?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.