Giter Site home page Giter Site logo

Comments (5)

thomwolf avatar thomwolf commented on May 1, 2024 2

Hi,
Why don't you guys just do tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') as indicated in the readme and the run_classifier.py example?

from transformers.

zlinao avatar zlinao commented on May 1, 2024

need to specify the path of vocab.txt for:
tokenizer = BertTokenizer.from_pretrained(args.bert_model)

from transformers.

coddinglxf avatar coddinglxf commented on May 1, 2024

@zlinao ,i try to load the vocab using the following code:
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt"

however,get errors
11/19/2018 15:33:13 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file bert-base-chinese//vocab.txt
Traceback (most recent call last):
File "E:/PythonWorkSpace/PytorchBert/BertTest/torchTest.py", line 6, in
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt")
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 141, in from_pretrained
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 95, in init
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 70, in load_vocab
UnicodeDecodeError: 'gbk' codec can't decode byte 0x81 in position 1564: illegal multibyte sequenc

do you have the same problem?

from transformers.

zlinao avatar zlinao commented on May 1, 2024

Hi,
Why don't you guys just do tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') as indicated in the readme and the run_classifier.py example?

Yes, it is easier to use shortcut name. Thanks for your great work.

from transformers.

zlinao avatar zlinao commented on May 1, 2024

@zlinao ,i try to load the vocab using the following code:
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt"

however,get errors
11/19/2018 15:33:13 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file bert-base-chinese//vocab.txt
Traceback (most recent call last):
File "E:/PythonWorkSpace/PytorchBert/BertTest/torchTest.py", line 6, in
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt")
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 141, in from_pretrained
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 95, in init
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 70, in load_vocab
UnicodeDecodeError: 'gbk' codec can't decode byte 0x81 in position 1564: illegal multibyte sequenc

do you have the same problem?

you can change you encoding to 'utf-8' when you load the vocab.txt

from transformers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.