agemagician / ankh Goto Github PK

View Code? Open in Web Editor NEW

192.0 192.0 19.0 39.95 MB

Ankh: Optimized Protein Language Model

License: Other

Python 100.00%

ankh's People

Contributors

Stargazers

Watchers

Forkers

dongcf eltociear obinnaobeleagu stjordanis biocad guruace biocoder007 danielnzg85 lajd schmigle ddofer darrengao628 kimdn or2513 qqlaoxia usmanovsky lhallee spyfighting rufus-willy

ankh's Issues

Generating sequences longer than 115

Hello!

I see that this implementation of T5 has ~115 extra ids for masking. Is it possible to tile or pattern these in someway to get the model to generate sequences longer than 115, or do new extra ids need to be added for this purpose? Didn't know if anyone had tried this yet.

Half-precision

Will performance be affected if half-precision is used with Ankh? As in model.half() to convert to FP16.
I am trying to optimize inference speed as much as possible.

How to compute the perplexity of a protein sequence by ankh PLM?

is feature extraction correct due to padding?

Thank you for this great protein LM. My question is regarding the feature extraction due to padding (given that the input sequences are from different length), in this case shouldn't the attention_mask be used as an additional input to the model for correct results, which makes sure that padding tokens of the inputs are ignored? like this:

  model, tokenizer = ankh.load_large_model()
  model.eval()

  protein_sequences = ['MKALCLLLLPVLGLLVSSKTLCSMEEAINERIQEVAGSLIFRAISSIGLECQSVTSRGDLATCPRGFAVTGCTCGSACGSWDVRAETTCHCQCAGMDWTGARCCRVQPLEHHHHHH', 
  'GSHMSLFDFFKNKGSAATATDRLKLILAKERTLNLPYMEEMRKEIIAVIQKYTKSSDIHFKTLDSNQSVETIEVEIILPR']

  protein_sequences = [list(seq) for seq in protein_sequences]


  outputs = tokenizer.batch_encode_plus(protein_sequences, 
                                    add_special_tokens=True, 
                                    padding=True, 
                                    is_split_into_words=True, 
                                    return_tensors="pt")
  with torch.no_grad():
    embeddings = model(input_ids=outputs['input_ids'], attention_mask=outputs["attention_mask"])

thank you.

Availability of fine-tuned models for downstream tasks

Hi there, great work on making these large pre-trained protein LMs available for us to use.

Are the fine-tuned models for prediction of secondary structure, contacts, solubility, fluorescence etc going to be made available?

Thanks!

python script for solubility score?

Hi,

Reading through the different documents in this repo, it's not clear to me how to run the solubility prediction to get a solubility score for a given protein.

Is there one such script? Thx.

Availability of training/fine-tuning scripts

Hi! Are there any publicly available training/fine-tuning scripts? I would like to fine-tune the pre-trained Ankh on a custom dataset of protein sequences using the masked setup described in section 4.6.1 of the manuscript. It would be wonderful if I could utilize your code, but unfortunately, I couldn't find it anywhere. Thank you!

MultiLabelClassification Issue

Hello @agemagician @hazemessamm ,

In the current multi-label convbert the output is logits that are batch_size, seq_len, emb_dim to compare with labels that are batch_size, emb_dim. F.binary_cross_entropy_with_logits requires the label and pred be the same size, so this will not work if I understand it correctly. One potential solution is to average across the sequence dimension to get batch_size, emb_dim but I'm not sure if this is optimal. Does it work as intended or have I actually found a bug?
Best,
Logan

Decode From Protein Representation

Hi,

Thanks for this amazing work.

I am wondering if AnKh supports decoding from extracted protein representation.

Thank you!

Missing model architecture and training details

Hello,

I have just read the paper, Great work!

I hoped to get more insights into the model architecture and training details, but none seems available on the repo.

It is a bit intriguing that a paper focused on so many pre-training experiments does not include the model architecture and training functions.

Secondary structure prediction dataset

Hi!

When running examples/secondary_structure_prediction_3_states.ipynb, I have an error with downloading dataset from HuggingFace. It looks like there are no such files as stated in the script.

FileNotFoundError: Couldn't find a dataset script at /home/u/Ankh/examples/proteinea/SSP/SSP.py or any data file in the same directory. Couldn't find 'proteinea/SSP' on the Hugging Face Hub either: FileNotFoundError: Unable to find training_data.csv in dataset repository proteinea/secondary_structure_prediction with any supported extension ['csv', ...]

Also, is there a straightforward way to use Ankh for secondary structure prediction on a single sequence protein?

Database of generated embeddings

Hi, thanks so much for your work !

I was wondering, using Ankh, have you generated the embeddings on uniref50 or another database and made it available somewhere by any chance ? It would be awesome and time saving ! Both in float16 or float64. (float16 would be easier to store, and I think it is as performant as float64 on downstream tasks ?)

Thanks

Generation within Sequence/Mask prediction

Thank you authors for your publication and work in open sourcing :)
Was able to run the simple_generation example you provided - this is great if one wants to generate at the end of a sequence.
Is it possible to generate within the middle of a sequence, using the existing model and mask token as well?

High-N generation

Dear authors,
Thank you for open-sourcing your amazing PLM.
Can you please provide an example of how to fine-tune model for sequence generation based on protein family?
Is that the same as fine-tuning causal language modeling with transformers library?

Thank you.
Ai