Comments (18)
I don't think you can do that with Bert. The masked LM loss is not a Language Modeling loss, it doesn't work nicely with the chain rule like the usual Language Modeling loss.
Please see the discussion on the TensorFlow repo on that here.
from transformers.
Hello @thomwolf I can see it is possible to assign score by using BERT . By masking each word sequentially. Then score sentence by summary of word score. Here is how people were doing it for Tensorflow. I am trying to do following
import numpy as np import torch from pytorch_pretrained_bert import BertTokenizer,BertForMaskedLM # Load pre-trained model (weights) with torch.no_grad(): model = BertForMaskedLM.from_pretrained('bert-large-cased') model.eval() # Load pre-trained model tokenizer (vocabulary) tokenizer = BertTokenizer.from_pretrained('bert-large-cased') def score(sentence): tokenize_input = tokenizer.tokenize(sentence) tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)]) sentence_loss=0. for i,word in enumerate(tokenize_input): tokenize_input[i]='[MASK]' mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)]) word_loss=model(mask_input, masked_lm_labels=tensor_input).data.numpy() sentence_loss +=word_loss #print("Word: %s : %f"%(word, np.exp(-word_loss))) return np.exp(sentence_loss/len(tokenize_input))
score("There is a book on the table") 88.899999
Is it the right way to assign score using BERT?
no, you masked word but not restore.
from transformers.
It should be similar. Following code is for distilBert
from torch.multiprocessing import TimeoutError, Pool,set_start_method,Queue
import torch.multiprocessing as mp
import torch
from transformers import DistilBertTokenizer,DistilBertForMaskedLM
from flask import Flask,request
import json
try:
set_start_method('spawn')
except RuntimeError:
pass
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def load_model():
model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased').to(device)
model.eval()
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
return tokenizer, model
tokenizer, model =load_model()
#st.text('Done!')
def score(sentence):
if len(sentence.strip().split())<=1 : return 10000
tokenize_input = tokenizer.tokenize(sentence)
if len(tokenize_input)>512: return 10000
input_ids = torch.tensor(tokenizer.encode(tokenize_input)).unsqueeze(0).to(device)
with torch.no_grad():
loss=model(input_ids,masked_lm_labels = input_ids)[0]
return math.exp(loss.item()/len(tokenize_input))```
from transformers.
Hello @thomwolf I can see it is possible to assign score by using BERT . By masking each word sequentially. Then score sentence by summary of word score. Here is how people were doing it for Tensorflow. I am trying to do following
import numpy as np
import torch
from pytorch_pretrained_bert import BertTokenizer,BertForMaskedLM
# Load pre-trained model (weights)
with torch.no_grad():
model = BertForMaskedLM.from_pretrained('bert-large-cased')
model.eval()
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-large-cased')
def score(sentence):
tokenize_input = tokenizer.tokenize(sentence)
tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
sentence_loss=0.
for i,word in enumerate(tokenize_input):
tokenize_input[i]='[MASK]'
mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
word_loss=model(mask_input, masked_lm_labels=tensor_input).data.numpy()
sentence_loss +=word_loss
#print("Word: %s : %f"%(word, np.exp(-word_loss)))
return np.exp(sentence_loss/len(tokenize_input))
score("There is a book on the table")
88.899999
Is it the right way to assign score using BERT?
from transformers.
@orenschonlab Try below
import torch
import sys
import numpy as np
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load pre-trained model (weights)
with torch.no_grad():
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
def score(sentence):
tokenize_input = tokenizer.encode(sentence)
tensor_input = torch.tensor([tokenize_input])
loss=model(tensor_input, labels=tensor_input)[0]
return np.exp(loss.detach().numpy())
if __name__=='__main__':
for line in sys.stdin:
if line.strip() !='':
print(line.strip()+'\t'+ str(score(line.strip())))
else:
break
from transformers.
@mdasadul Did you managed to do it?
from transformers.
from transformers.
@mdasadul Do you mean this one?
https://twitter.com/mdasaduluofa/status/1181917072999231489/photo/1
I see this it for GPT-2, do you have a code for BERT?
from transformers.
@mdasadul I get the error:
TypeError: forward() got an unexpected keyword argument 'masked_lm_labels'
Also, can you please explain why for following steps are necessary:
unsqueeze(0)
- add
torch.no_grad()
- add
model.eval()
from transformers.
The score is equivalent to perplexity. Hence lower the score better the sentence, right?
from transformers.
from transformers.
@mdasadul I get the error:
return math.exp(loss.item() / len(tokenize_input))
ValueError: only one element tensors can be converted to Python scalars
Any idea why?
from transformers.
from transformers.
@mdasadul I have a sentence with more than 1 word and still get the error
sentence is ' Harry had never believed he would'
input_ids is tensor([[ 101, 4302, 2018, 2196, 3373, 2002, 2052, 102]])
from transformers.
Below is an example from the official docs on how to implement GPT2 to determine perplexity.
https://huggingface.co/transformers/perplexity.html
from transformers.
@EricFillion But how can it be used for a sentence, not for a dataset?
Meaning I want the perplexity of the sentence:
Harry had never believed he would
from transformers.
@EricFillion But how can it be used for a sentence, not for a dataset?
Meaning I want the perplexity of the sentence:
Harry had never believed he would
I just played around with the code @mdasadul posted above. It works perfectly and is nice and concise. It outputted the same scores from the official documentation for short inputs.
If you're still interested in using the method from the official documentation, then you can replace "'\n\n'.join(test['text'])" with the text you wish to determine the perplexity of. You'll also want to add ".item()" to ppl to convert the tensor to a float.
from transformers.
This repo is quite useful. It supports Huggingface models.
https://github.com/awslabs/mlm-scoring
from transformers.
Related Issues (20)
- 4.39.3; ZeroShotClassificationPipeline broken. HOT 2
- Why cast to float32 in this line? HOT 2
- Tranformers documentation translation to Persian HOT 1
- RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback): name 'LRScheduler' is not defined HOT 6
- Providing several prompt_images and prompt_masks to seggpt leads to RuntimeError HOT 4
- Can't save checkpoint with shared tensors HOT 3
- Enhance HfArgumentParser with Dict command-line parser HOT 1
- try eval befor train gives ValueError with deepspeed Zero2
- `BartForConditionalGeneration` has no attribute `shared` HOT 1
- OPRO-FT- config.json file not loaded -Andyrasika/Mistral7b-ORPO HOT 3
- EncoderDecoderModel with XLM-R
- Mamba: which tokenizer has been saved and how to use it? HOT 1
- Create panoptic segmentation task guide
- Error at the generation stage by MusicGen stereo model HOT 3
- Trying to stack tensors from different devices in `_pad_to_max_length` in Whisper batched inference
- [Whisper] Word-level timestamps broken for short-form audio HOT 2
- [BUG] Load StarCoder2 AWQ using Transformers HOT 5
- `import transformers` accidentally initializing both torch and jax/xla at startup time HOT 5
- FSDP Doesn't Work with model.generate() HOT 2
- Nondeterministic behavior from GPT with MPS backend HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformers.