Giter Site home page Giter Site logo

ankh's People

Contributors

agemagician avatar ddofer avatar eltociear avatar hazemessamm avatar lhallee avatar sarahbadawy avatar wafaaashraf avatar wmustafaawad avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ankh's Issues

Generating sequences longer than 115

Hello!

I see that this implementation of T5 has ~115 extra ids for masking. Is it possible to tile or pattern these in someway to get the model to generate sequences longer than 115, or do new extra ids need to be added for this purpose? Didn't know if anyone had tried this yet.

Half-precision

Will performance be affected if half-precision is used with Ankh? As in model.half() to convert to FP16.
I am trying to optimize inference speed as much as possible.

is feature extraction correct due to padding?

Thank you for this great protein LM. My question is regarding the feature extraction due to padding (given that the input sequences are from different length), in this case shouldn't the attention_mask be used as an additional input to the model for correct results, which makes sure that padding tokens of the inputs are ignored? like this:

  model, tokenizer = ankh.load_large_model()
  model.eval()

  protein_sequences = ['MKALCLLLLPVLGLLVSSKTLCSMEEAINERIQEVAGSLIFRAISSIGLECQSVTSRGDLATCPRGFAVTGCTCGSACGSWDVRAETTCHCQCAGMDWTGARCCRVQPLEHHHHHH', 
  'GSHMSLFDFFKNKGSAATATDRLKLILAKERTLNLPYMEEMRKEIIAVIQKYTKSSDIHFKTLDSNQSVETIEVEIILPR']

  protein_sequences = [list(seq) for seq in protein_sequences]


  outputs = tokenizer.batch_encode_plus(protein_sequences, 
                                    add_special_tokens=True, 
                                    padding=True, 
                                    is_split_into_words=True, 
                                    return_tensors="pt")
  with torch.no_grad():
    embeddings = model(input_ids=outputs['input_ids'], attention_mask=outputs["attention_mask"])

thank you.

Availability of fine-tuned models for downstream tasks

Hi there, great work on making these large pre-trained protein LMs available for us to use.

Are the fine-tuned models for prediction of secondary structure, contacts, solubility, fluorescence etc going to be made available?

Thanks!

python script for solubility score?

Hi,

Reading through the different documents in this repo, it's not clear to me how to run the solubility prediction to get a solubility score for a given protein.

Is there one such script? Thx.

Availability of training/fine-tuning scripts

Hi! Are there any publicly available training/fine-tuning scripts? I would like to fine-tune the pre-trained Ankh on a custom dataset of protein sequences using the masked setup described in section 4.6.1 of the manuscript. It would be wonderful if I could utilize your code, but unfortunately, I couldn't find it anywhere. Thank you!

MultiLabelClassification Issue

Hello @agemagician @hazemessamm ,

In the current multi-label convbert the output is logits that are batch_size, seq_len, emb_dim to compare with labels that are batch_size, emb_dim. F.binary_cross_entropy_with_logits requires the label and pred be the same size, so this will not work if I understand it correctly. One potential solution is to average across the sequence dimension to get batch_size, emb_dim but I'm not sure if this is optimal. Does it work as intended or have I actually found a bug?
Best,
Logan

Missing model architecture and training details

Hello,

I have just read the paper, Great work!

I hoped to get more insights into the model architecture and training details, but none seems available on the repo.

It is a bit intriguing that a paper focused on so many pre-training experiments does not include the model architecture and training functions.

Secondary structure prediction dataset

Hi!

When running examples/secondary_structure_prediction_3_states.ipynb, I have an error with downloading dataset from HuggingFace. It looks like there are no such files as stated in the script.

FileNotFoundError: Couldn't find a dataset script at /home/u/Ankh/examples/proteinea/SSP/SSP.py or any data file in the same directory. Couldn't find 'proteinea/SSP' on the Hugging Face Hub either: FileNotFoundError: Unable to find training_data.csv in dataset repository proteinea/secondary_structure_prediction with any supported extension ['csv', ...]

Also, is there a straightforward way to use Ankh for secondary structure prediction on a single sequence protein?

Database of generated embeddings

Hi, thanks so much for your work !

I was wondering, using Ankh, have you generated the embeddings on uniref50 or another database and made it available somewhere by any chance ? It would be awesome and time saving ! Both in float16 or float64. (float16 would be easier to store, and I think it is as performant as float64 on downstream tasks ?)

Thanks

Generation within Sequence/Mask prediction

Thank you authors for your publication and work in open sourcing :)
Was able to run the simple_generation example you provided - this is great if one wants to generate at the end of a sequence.
Is it possible to generate within the middle of a sequence, using the existing model and mask token as well?

High-N generation

Dear authors,
Thank you for open-sourcing your amazing PLM.
Can you please provide an example of how to fine-tune model for sequence generation based on protein family?
Is that the same as fine-tuning causal language modeling with transformers library?

Thank you.
Ai

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.