Dear Authors,
Thank you for the great work!
I was reviewing the code and noticed that the way you extract embeddings is a bit different than what is typically done in terms of padding to the max len (32 tokens). Normally I don't see others who extract embeddings to do it this way, e.g., they just tokenize and pass the inputs through the model.
I tried experimenting with different token lengths and using and not using any additional padding. The results are that there are significant differences in the ultimate cosine similarity scores between the embeddings depending on whether padding is used, and how much of it is (e.g., what the max token length is).
I re-read your coder papers and didn't find anything about padding, nor could I find anything more in this repo. Can you explain why you chose the padding strategy you did? Have you experimented with not using or adjusting the amount of padding and its impact ultimately on cosine similarity between embeddings and overall performance?