Giter Site home page Giter Site logo

Comments (18)

thomwolf avatar thomwolf commented on May 1, 2024 7

I don't think you can do that with Bert. The masked LM loss is not a Language Modeling loss, it doesn't work nicely with the chain rule like the usual Language Modeling loss.
Please see the discussion on the TensorFlow repo on that here.

from transformers.

zhangyichang avatar zhangyichang commented on May 1, 2024 4

Hello @thomwolf I can see it is possible to assign score by using BERT . By masking each word sequentially. Then score sentence by summary of word score. Here is how people were doing it for Tensorflow. I am trying to do following

import numpy as np
import torch
from pytorch_pretrained_bert import BertTokenizer,BertForMaskedLM
# Load pre-trained model (weights)
with torch.no_grad():
    model = BertForMaskedLM.from_pretrained('bert-large-cased')
    model.eval()
    # Load pre-trained model tokenizer (vocabulary)
    tokenizer = BertTokenizer.from_pretrained('bert-large-cased')
def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    sentence_loss=0.
    for i,word in enumerate(tokenize_input):

        tokenize_input[i]='[MASK]'
        mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
        word_loss=model(mask_input, masked_lm_labels=tensor_input).data.numpy()
        sentence_loss +=word_loss
        #print("Word: %s : %f"%(word, np.exp(-word_loss)))

    return np.exp(sentence_loss/len(tokenize_input))
score("There is a book on the table")
88.899999

Is it the right way to assign score using BERT?

no, you masked word but not restore.

from transformers.

mdasadul avatar mdasadul commented on May 1, 2024 2

It should be similar. Following code is for distilBert

from torch.multiprocessing import TimeoutError, Pool,set_start_method,Queue
import torch.multiprocessing as mp
import torch
from transformers import  DistilBertTokenizer,DistilBertForMaskedLM
from flask import Flask,request
import json

try:
    set_start_method('spawn')
except RuntimeError:
    pass

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def load_model():
    model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased').to(device)
    model.eval()
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    return tokenizer, model

tokenizer, model =load_model()
#st.text('Done!')

def score(sentence):
    if len(sentence.strip().split())<=1 : return 10000
    tokenize_input = tokenizer.tokenize(sentence)
    if len(tokenize_input)>512: return 10000
    input_ids = torch.tensor(tokenizer.encode(tokenize_input)).unsqueeze(0).to(device)
    with torch.no_grad():
        loss=model(input_ids,masked_lm_labels = input_ids)[0]
    return  math.exp(loss.item()/len(tokenize_input))```

from transformers.

mdasadul avatar mdasadul commented on May 1, 2024 1

Hello @thomwolf I can see it is possible to assign score by using BERT . By masking each word sequentially. Then score sentence by summary of word score. Here is how people were doing it for Tensorflow. I am trying to do following

import numpy as np
import torch
from pytorch_pretrained_bert import BertTokenizer,BertForMaskedLM
# Load pre-trained model (weights)
with torch.no_grad():
    model = BertForMaskedLM.from_pretrained('bert-large-cased')
    model.eval()
    # Load pre-trained model tokenizer (vocabulary)
    tokenizer = BertTokenizer.from_pretrained('bert-large-cased')
def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    sentence_loss=0.
    for i,word in enumerate(tokenize_input):

        tokenize_input[i]='[MASK]'
        mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
        word_loss=model(mask_input, masked_lm_labels=tensor_input).data.numpy()
        sentence_loss +=word_loss
        #print("Word: %s : %f"%(word, np.exp(-word_loss)))

    return np.exp(sentence_loss/len(tokenize_input))

score("There is a book on the table")
88.899999

Is it the right way to assign score using BERT?

from transformers.

mdasadul avatar mdasadul commented on May 1, 2024 1

@orenschonlab Try below

import torch
import sys
import numpy as np
 
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load pre-trained model (weights)
with torch.no_grad():
        model = GPT2LMHeadModel.from_pretrained('gpt2')
        model.eval()
# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

def score(sentence):
    tokenize_input = tokenizer.encode(sentence)
    tensor_input = torch.tensor([tokenize_input])
    loss=model(tensor_input, labels=tensor_input)[0]
    return np.exp(loss.detach().numpy())
 
if __name__=='__main__':
    for line in sys.stdin:
        if line.strip() !='':
            print(line.strip()+'\t'+ str(score(line.strip())))
        else:
            break

from transformers.

orenpapers avatar orenpapers commented on May 1, 2024

@mdasadul Did you managed to do it?

from transformers.

mdasadul avatar mdasadul commented on May 1, 2024

from transformers.

orenpapers avatar orenpapers commented on May 1, 2024

@mdasadul Do you mean this one?
https://twitter.com/mdasaduluofa/status/1181917072999231489/photo/1
I see this it for GPT-2, do you have a code for BERT?

from transformers.

orenpapers avatar orenpapers commented on May 1, 2024

@mdasadul I get the error:
TypeError: forward() got an unexpected keyword argument 'masked_lm_labels'
Also, can you please explain why for following steps are necessary:

  1. unsqueeze(0)
  2. add torch.no_grad()
  3. add model.eval()

from transformers.

nlp-sudo avatar nlp-sudo commented on May 1, 2024

The score is equivalent to perplexity. Hence lower the score better the sentence, right?

from transformers.

mdasadul avatar mdasadul commented on May 1, 2024

from transformers.

orenschonlab avatar orenschonlab commented on May 1, 2024

@mdasadul I get the error:

    return math.exp(loss.item() / len(tokenize_input))
ValueError: only one element tensors can be converted to Python scalars

Any idea why?

from transformers.

mdasadul avatar mdasadul commented on May 1, 2024

from transformers.

orenschonlab avatar orenschonlab commented on May 1, 2024

@mdasadul I have a sentence with more than 1 word and still get the error
sentence is ' Harry had never believed he would'
input_ids is tensor([[ 101, 4302, 2018, 2196, 3373, 2002, 2052, 102]])

from transformers.

EricFillion avatar EricFillion commented on May 1, 2024

Below is an example from the official docs on how to implement GPT2 to determine perplexity.

https://huggingface.co/transformers/perplexity.html

from transformers.

orenschonlab avatar orenschonlab commented on May 1, 2024

@EricFillion But how can it be used for a sentence, not for a dataset?
Meaning I want the perplexity of the sentence:
Harry had never believed he would

from transformers.

EricFillion avatar EricFillion commented on May 1, 2024

@EricFillion But how can it be used for a sentence, not for a dataset?
Meaning I want the perplexity of the sentence:
Harry had never believed he would

I just played around with the code @mdasadul posted above. It works perfectly and is nice and concise. It outputted the same scores from the official documentation for short inputs.

If you're still interested in using the method from the official documentation, then you can replace "'\n\n'.join(test['text'])" with the text you wish to determine the perplexity of. You'll also want to add ".item()" to ppl to convert the tensor to a float.

from transformers.

kaisugi avatar kaisugi commented on May 1, 2024

This repo is quite useful. It supports Huggingface models.

https://github.com/awslabs/mlm-scoring

from transformers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.