Comments (5)
Hi,
Why don't you guys just do tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
as indicated in the readme and the run_classifier.py
example?
from transformers.
need to specify the path of vocab.txt for:
tokenizer = BertTokenizer.from_pretrained(args.bert_model)
from transformers.
@zlinao ,i try to load the vocab using the following code:
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt"
however,get errors
11/19/2018 15:33:13 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file bert-base-chinese//vocab.txt
Traceback (most recent call last):
File "E:/PythonWorkSpace/PytorchBert/BertTest/torchTest.py", line 6, in
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt")
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 141, in from_pretrained
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 95, in init
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 70, in load_vocab
UnicodeDecodeError: 'gbk' codec can't decode byte 0x81 in position 1564: illegal multibyte sequenc
do you have the same problem?
from transformers.
Hi,
Why don't you guys just dotokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
as indicated in the readme and therun_classifier.py
example?
Yes, it is easier to use shortcut name. Thanks for your great work.
from transformers.
@zlinao ,i try to load the vocab using the following code:
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt"however,get errors
11/19/2018 15:33:13 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file bert-base-chinese//vocab.txt
Traceback (most recent call last):
File "E:/PythonWorkSpace/PytorchBert/BertTest/torchTest.py", line 6, in
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt")
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 141, in from_pretrained
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 95, in init
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 70, in load_vocab
UnicodeDecodeError: 'gbk' codec can't decode byte 0x81 in position 1564: illegal multibyte sequencdo you have the same problem?
you can change you encoding to 'utf-8' when you load the vocab.txt
from transformers.
Related Issues (20)
- Bad weight init dependant of a processor import HOT 16
- All-zero embedding distorts StaticCache.get_seq_length() output HOT 2
- Batch inputs get different result to single input for llama model. HOT 6
- Weights of LlamaForQuestionAnswering were not initialized from the model checkpoint HOT 1
- data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 12564 column 3 HOT 3
- [BLIP-2] BitsAndBytes 4 and 8 bit give empty string HOT 6
- Update `make-fixup` to make sure image processor tests present
- Add Llama 3 support to `convert_llama_weights_to_hf()` HOT 6
- Reccurent Gemma forward pass does not match with Google DeepMind original implementation HOT 3
- Idefics2 - sdpa vs flash attn 2 HOT 2
- Add HelpingAI-3B-v2.2: Emotionally Intelligent Conversational AI
- tranformers.Trainer.train() hangs with Llama3 base model HOT 7
- Llama generation with static cache fails in certain sequence lengths
- [FSDP] redundant additional allgather during backward when using FSDP FULL_SHARD with gradient checkpointing HOT 1
- Assistant model not working for different sized openai models when using pipeline for ASR HOT 1
- Error trying to use https://huggingface.co/microsoft/BiomedVLP-CXR-BERT-specialized HOT 1
- Assisted decoding results are not correct HOT 6
- Why do the implementation behaviors of official llava and transformers differ? HOT 3
- Gemma's Tokenizer fails to split on spaces
- `StaticCache` Bad generation results with Llama after v4.39.0 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformers.