Comments (30)
Would it possible to share the sentence that causes the error? I can run it locally to see if I can reproduce the error.
from lmppl.
It is not a sentence, it is a a list of lines from .txt files, so it is essentially a lot of sentences in a list and I want to evaluate them all
from lmppl.
And unfortunately I cannot share the dataset anywhere because it has PHI in it
from lmppl.
Ok. Probably, something to do with the sequence length. Which model are you using to compute perplexity? Also, could you tell me the number of the maximum character in a single line in your file?
from lmppl.
I am fairly sure that it is 512, I think I need help on how to batch my dataset
from lmppl.
If you run it on a single instance, instead of passing a list, you should be able to find the line that causes the error.
text = dsMap['test']['text']
for t in text:
ppl = scorer.get_perplexity([t])
from lmppl.
I am running a version of BERT, but the line that caused issues has only 129 characters in it
from lmppl.
Does the sentence contain only roman characters?
from lmppl.
It has dashes, commas, a period, and a colon, I don't know if those count, but other than that, yes
from lmppl.
I have pretrained BERT on this same dataset before with the same tokenizer and model.
from lmppl.
Could you try to compute perplexity on the same files but with roberta-base
? It might be BERT specific issue.
from lmppl.
Same error, same spot
from lmppl.
But it did take it a little longer to get there although that may be because it was loading a new model in
from lmppl.
So RoBERTa and BERT raises the same error in the same line, correct?
from lmppl.
Correct.
from lmppl.
If you could mask the alphabet in the sentence with a random character, would it be possible to share here? For instance, if the sentence is @lmppl I have, to say this √√ is &&private 1234---
, then you could convert it into @aaaa a aaaa, aa aaa aaaa √√ aa &&aaaaaaa aaaa---
. Here, I replaced all the alphabet by a
. This way, there's no way I can restore the original sentence, but there's a high chance that you would see same issue there.
from lmppl.
I guess alphabet is not the root cause of the error, and that's why to debug the issue, they are not imporatn.
from lmppl.
Here:
Aaaaaaaaa: Aaaaaaaaa aaaaaaa aaaa aaaaaaaa - AAA10-AA A19.94, Aaaaaaaaa aaaaa - AAA10-AA A19.10, Aaaaaaa aaaaaaaaaa - AAA10-
from lmppl.
I can confirm that in my code, running just that sentence, does reproduce the same error, so if that doesn't work for you, it may be that I have accidentally edited the code aside from replacing the transformer.
from lmppl.
It's working fine with the latest lmppl (0.3.1) indeed as below.
In [1] from lmppl import LM
In [2] lm = LM('roberta-base')
In [3] text = "Aaaaaaaaa: Aaaaaaaaa aaaaaaa aaaa aaaaaaaa - AAA10-AA A19.94, Aaaaaaaaa aaaaa - AAA10-AA A19.10, Aaaaaaa aaaaaaaaaa - AAA10-"
In [4] lm.get_perplexity(text)
Out[4]: 115701099.01106478
from lmppl.
So recopied the run_mlm.py and then ran it as is with roberta-base and here is what I got:
RuntimeError Traceback (most recent call last)
Cell In[115], line 5
3 scorer = MaskedLM('roberta-base')
4 text = dsMap['test']['text']
----> 5 ppl = scorer.get_perplexity("Aaaaaaaaa: Aaaaaaaaa aaaaaaa aaaa aaaaaaaa - AAA10-AA A19.94, Aaaaaaaaa aaaaa - AAA10-AA A19.10, Aaaaaaa aaaaaaaaaa - AAA10-")
6 # for t in text:
7 # ppl = scorer.get_perplexity([t])
8 # print(t)
9 #ppl = scorer.get_perplexity(text, batch=32)
10 print(ppl)
Cell In[113], line 155, in MaskedLM.get_perplexity(self, input_texts, batch)
153 for s, e in tqdm(batch_id):
154 _encode = data[s:e]
--> 155 _encode = {k: torch.cat([o[k] for o in _encode], dim=0).to(self.device) for k in _encode[0].keys()}
156 labels = _encode.pop('labels')
157 output = self.model(**_encode, return_dict=True)
Cell In[113], line 155, in <dictcomp>(.0)
153 for s, e in tqdm(batch_id):
154 _encode = data[s:e]
--> 155 _encode = {k: torch.cat([o[k] for o in _encode], dim=0).to(self.device) for k in _encode[0].keys()}
156 labels = _encode.pop('labels')
157 output = self.model(**_encode, return_dict=True)
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 54 but got size 53 for tensor number 5 in the list.
from lmppl.
I just ran your exact code and that worked, can I send you my version of the code? (run_mlm.py I mean)
from lmppl.
Ok so with a little bit of modification, I can run my whole dataset on the code that you sent, now I just need to figure out how to be able to modify that to have a different transformer without breaking it I think.
from lmppl.
Although I do wonder why you chose to use the version for GPT variants rather than the one for BERT variants in your example
from lmppl.
If I import and use MaskedLM rather than LM, it breaks, not sure why though.
from lmppl.
Could you try following instead?
scorer.get_perplexity(["Aaaaaaaaa: Aaaaaaaaa aaaaaaa aaaa aaaaaaaa - AAA10-AA A19.94, Aaaaaaaaa aaaaa - AAA10-AA A19.10, Aaaaaaa aaaaaaaaaa - AAA10-"])
from lmppl.
Ah wait, you're right. I should have use MaskedLM, but not LM.
from lmppl.
Yeah, it's working without any issue.
In [1]: from lmppl import MaskedLM
In [2]: text = "Aaaaaaaaa: Aaaaaaaaa aaaaaaa aaaa aaaaaaaa - AAA10-AA A19.94, Aaaaaaaaa aaaaa - AAA10-AA A19.10, Aaaaaaa aaaaaaaaaa - AAA10-"
In [3]: lm = MaskedLM('roberta-base')
In [4]: lm.get_perplexity(text)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.13s/it]
Out[4]: 2.676919185346931
from lmppl.
Ok, so that simple example works, but then when I tried to loop through it like so:
from lmppl import MaskedLM
lm = MaskedLM('/path/to/local/model/')
text = dsMap['test']['text']
num = 0
sum1 = 0
for t in text:
sum1 = lm.get_perplexity(t) + sum1
num = num + 1
print(sum1/num)
it returns another tensor error on the very same text that we were just testing. It gets past the first two loops but then breaks on the third which is what we have been working on:
RuntimeError Traceback (most recent call last)
Cell In[14], line 8
6 sum1 = 0
7 for t in text:
----> 8 sum1 = lm.get_perplexity(t) + sum1
9 num = num + 1
10 print(sum1/num)
File ~/.conda/envs/BioBERTUAB/lib/python3.10/site-packages/lmppl/ppl_mlm.py:154, in MaskedLM.get_perplexity(self, input_texts, batch)
152 for s, e in tqdm(batch_id):
153 _encode = data[s:e]
--> 154 _encode = {k: torch.cat([o[k] for o in _encode], dim=0).to(self.device) for k in _encode[0].keys()}
155 labels = _encode.pop('labels')
156 output = self.model(**_encode, return_dict=True)
File ~/.conda/envs/BioBERTUAB/lib/python3.10/site-packages/lmppl/ppl_mlm.py:154, in <dictcomp>(.0)
152 for s, e in tqdm(batch_id):
153 _encode = data[s:e]
--> 154 _encode = {k: torch.cat([o[k] for o in _encode], dim=0).to(self.device) for k in _encode[0].keys()}
155 labels = _encode.pop('labels')
156 output = self.model(**_encode, return_dict=True)
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 13 but got size 12 for tensor number 1 in the list.
from lmppl.
It does not return any errors on the same exact code but with LM, but it also returns a perplexity number that is wildly incorrect because it isn't the right type of eval for the model
from lmppl.
Related Issues (9)
- RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! HOT 13
- RuntimeError: Sizes of tensors must match HOT 4
- ppl in openai model HOT 1
- Handling of long input HOT 3
- A quite large perplexity issue HOT 2
- ImportError while trying to use device_map = "auto" parameter.
- Can the LLAMA model be used for this project, please HOT 1
- Please Update Readme - Available Models
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lmppl.