After I convert the TF model to pytorch model, I run a classification task on a new Ch

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Can not find vocabulary file for Chinese model about transformers HOT 5 CLOSED

huggingface commented on May 1, 2024

Can not find vocabulary file for Chinese model

from transformers.

Comments (5)

thomwolf commented on May 1, 2024 2

Hi,
Why don't you guys just do tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') as indicated in the readme and the run_classifier.py example?

from transformers.

zlinao commented on May 1, 2024

need to specify the path of vocab.txt for:
tokenizer = BertTokenizer.from_pretrained(args.bert_model)

from transformers.

coddinglxf commented on May 1, 2024

@zlinao ，i try to load the vocab using the following code：
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt"

however，get errors
11/19/2018 15:33:13 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file bert-base-chinese//vocab.txt
Traceback (most recent call last):
File "E:/PythonWorkSpace/PytorchBert/BertTest/torchTest.py", line 6, in
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt")
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 141, in from_pretrained
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 95, in init
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 70, in load_vocab
UnicodeDecodeError: 'gbk' codec can't decode byte 0x81 in position 1564: illegal multibyte sequenc

do you have the same problem？

from transformers.

zlinao commented on May 1, 2024

Hi,
Why don't you guys just do tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') as indicated in the readme and the run_classifier.py example?

Yes, it is easier to use shortcut name. Thanks for your great work.

from transformers.

zlinao commented on May 1, 2024

@zlinao ，i try to load the vocab using the following code：
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt"

however，get errors
11/19/2018 15:33:13 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file bert-base-chinese//vocab.txt
Traceback (most recent call last):
File "E:/PythonWorkSpace/PytorchBert/BertTest/torchTest.py", line 6, in
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt")
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 141, in from_pretrained
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 95, in init
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 70, in load_vocab
UnicodeDecodeError: 'gbk' codec can't decode byte 0x81 in position 1564: illegal multibyte sequenc

do you have the same problem？

you can change you encoding to 'utf-8' when you load the vocab.txt

from transformers.

Recommend Projects

Can not find vocabulary file for Chinese model about transformers HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent