I was trying to use BERT as a language model to assign a score(could be PPL score) of

Hello <a class="user-mention notranslate" data-hovercard-type="user" data

It should be similar. Following code is for distilBert <div class="snippet-clipboa

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Yes please check my tweet on this @mdasaduluofa <span class="email

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

using BERT as a language Model about transformers HOT 18 CLOSED

huggingface commented on May 1, 2024

using BERT as a language Model

from transformers.

Comments (18)

thomwolf commented on May 1, 2024 7

I don't think you can do that with Bert. The masked LM loss is not a Language Modeling loss, it doesn't work nicely with the chain rule like the usual Language Modeling loss.
Please see the discussion on the TensorFlow repo on that here.

from transformers.

zhangyichang commented on May 1, 2024 4

Hello @thomwolf I can see it is possible to assign score by using BERT . By masking each word sequentially. Then score sentence by summary of word score. Here is how people were doing it for Tensorflow. I am trying to do following

import numpy as np
import torch
from pytorch_pretrained_bert import BertTokenizer,BertForMaskedLM
# Load pre-trained model (weights)
with torch.no_grad():
    model = BertForMaskedLM.from_pretrained('bert-large-cased')
    model.eval()
    # Load pre-trained model tokenizer (vocabulary)
    tokenizer = BertTokenizer.from_pretrained('bert-large-cased')
def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    sentence_loss=0.
    for i,word in enumerate(tokenize_input):

        tokenize_input[i]='[MASK]'
        mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
        word_loss=model(mask_input, masked_lm_labels=tensor_input).data.numpy()
        sentence_loss +=word_loss
        #print("Word: %s : %f"%(word, np.exp(-word_loss)))

    return np.exp(sentence_loss/len(tokenize_input))

score("There is a book on the table")
88.899999

Is it the right way to assign score using BERT?

no, you masked word but not restore.

from transformers.

mdasadul commented on May 1, 2024 2

It should be similar. Following code is for distilBert

from torch.multiprocessing import TimeoutError, Pool,set_start_method,Queue
import torch.multiprocessing as mp
import torch
from transformers import  DistilBertTokenizer,DistilBertForMaskedLM
from flask import Flask,request
import json

try:
    set_start_method('spawn')
except RuntimeError:
    pass

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def load_model():
    model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased').to(device)
    model.eval()
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    return tokenizer, model

tokenizer, model =load_model()
#st.text('Done!')

def score(sentence):
    if len(sentence.strip().split())<=1 : return 10000
    tokenize_input = tokenizer.tokenize(sentence)
    if len(tokenize_input)>512: return 10000
    input_ids = torch.tensor(tokenizer.encode(tokenize_input)).unsqueeze(0).to(device)
    with torch.no_grad():
        loss=model(input_ids,masked_lm_labels = input_ids)[0]
    return  math.exp(loss.item()/len(tokenize_input))```

from transformers.

mdasadul commented on May 1, 2024 1

import numpy as np
import torch
from pytorch_pretrained_bert import BertTokenizer,BertForMaskedLM
# Load pre-trained model (weights)
with torch.no_grad():
    model = BertForMaskedLM.from_pretrained('bert-large-cased')
    model.eval()
    # Load pre-trained model tokenizer (vocabulary)
    tokenizer = BertTokenizer.from_pretrained('bert-large-cased')
def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    sentence_loss=0.
    for i,word in enumerate(tokenize_input):

        tokenize_input[i]='[MASK]'
        mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
        word_loss=model(mask_input, masked_lm_labels=tensor_input).data.numpy()
        sentence_loss +=word_loss
        #print("Word: %s : %f"%(word, np.exp(-word_loss)))

    return np.exp(sentence_loss/len(tokenize_input))

score("There is a book on the table")
88.899999

Is it the right way to assign score using BERT?

from transformers.

mdasadul commented on May 1, 2024 1

@orenschonlab Try below

import torch
import sys
import numpy as np
 
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load pre-trained model (weights)
with torch.no_grad():
        model = GPT2LMHeadModel.from_pretrained('gpt2')
        model.eval()
# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

def score(sentence):
    tokenize_input = tokenizer.encode(sentence)
    tensor_input = torch.tensor([tokenize_input])
    loss=model(tensor_input, labels=tensor_input)[0]
    return np.exp(loss.detach().numpy())
 
if __name__=='__main__':
    for line in sys.stdin:
        if line.strip() !='':
            print(line.strip()+'\t'+ str(score(line.strip())))
        else:
            break

from transformers.

orenpapers commented on May 1, 2024

@mdasadul Did you managed to do it?

from transformers.

mdasadul commented on May 1, 2024

Yes please check my tweet on this @mdasaduluofa

…

On Wed, May 27, 2020, 1:37 PM orko19 ***@***.***> wrote: @mdasadul <https://github.com/mdasadul> Did you managed to do it? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5DO5N2MGF6QCTAZ3L3NITRTS7J3ANCNFSM4GFFKJJA> .

from transformers.

orenpapers commented on May 1, 2024

@mdasadul Do you mean this one?
https://twitter.com/mdasaduluofa/status/1181917072999231489/photo/1
I see this it for GPT-2, do you have a code for BERT?

from transformers.

orenpapers commented on May 1, 2024

@mdasadul I get the error:
TypeError: forward() got an unexpected keyword argument 'masked_lm_labels'
Also, can you please explain why for following steps are necessary:

unsqueeze(0)
add torch.no_grad()
add model.eval()

from transformers.

nlp-sudo commented on May 1, 2024

The score is equivalent to perplexity. Hence lower the score better the sentence, right?

from transformers.

mdasadul commented on May 1, 2024

Yes that is right Md Asadul Islam Machine Learning Engineer Scribendi Inc

…

On Mon, Jul 6, 2020 at 11:54 PM nlp-sudo ***@***.***> wrote: The score is equivalent to perplexity. Hence lower the score better the sentence, right? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5DO5KTBQJEEM7J72TCH2LR2K2AVANCNFSM4GFFKJJA> .

from transformers.

orenschonlab commented on May 1, 2024

@mdasadul I get the error:

    return math.exp(loss.item() / len(tokenize_input))
ValueError: only one element tensors can be converted to Python scalars

Any idea why?

from transformers.

mdasadul commented on May 1, 2024

Yes, your sentence needs to be longer than 1 word. PPL of 1 word sentence doesn't mean anything. Please try with longer sentences Md Asadul Islam Machine Learning Engineer Scribendi Inc

…

On Sun, Mar 14, 2021 at 7:48 AM orenschonlab ***@***.***> wrote: @mdasadul <https://github.com/mdasadul> I get the error: return math.exp(loss.item() / len(tokenize_input)) ValueError: only one element tensors can be converted to Python scalars Any idea why? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5DO5ITYG7M6TG2XV5NZ6LTDSPARANCNFSM4GFFKJJA> .

from transformers.

orenschonlab commented on May 1, 2024

@mdasadul I have a sentence with more than 1 word and still get the error
sentence is ' Harry had never believed he would'
input_ids is tensor([[ 101, 4302, 2018, 2196, 3373, 2002, 2052, 102]])

from transformers.

EricFillion commented on May 1, 2024

Below is an example from the official docs on how to implement GPT2 to determine perplexity.

https://huggingface.co/transformers/perplexity.html

from transformers.

orenschonlab commented on May 1, 2024

@EricFillion But how can it be used for a sentence, not for a dataset?
Meaning I want the perplexity of the sentence:
Harry had never believed he would

from transformers.

EricFillion commented on May 1, 2024

@EricFillion But how can it be used for a sentence, not for a dataset?
Meaning I want the perplexity of the sentence:
Harry had never believed he would

I just played around with the code @mdasadul posted above. It works perfectly and is nice and concise. It outputted the same scores from the official documentation for short inputs.

If you're still interested in using the method from the official documentation, then you can replace "'\n\n'.join(test['text'])" with the text you wish to determine the perplexity of. You'll also want to add ".item()" to ppl to convert the tensor to a float.

from transformers.

kaisugi commented on May 1, 2024

This repo is quite useful. It supports Huggingface models.

https://github.com/awslabs/mlm-scoring

from transformers.

using BERT as a language Model about transformers HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent