System Info transformers ve

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hey! I think this was recently fixed, so installing <code class="notransl

The behavior of the tokenizer loaded from GGUF file is incorrect. about transformers HOT 5 OPEN

Lin-xs commented on August 21, 2024

The behavior of the tokenizer loaded from GGUF file is incorrect.

from transformers.

Comments (5)

ArthurZucker commented on August 21, 2024 1

cc @itazap

from transformers.

ArthurZucker commented on August 21, 2024

Hey! I think this was recently fixed, so installing 4.42.xxx should work. I just tested locally:

Make sure to install pip install -U transformers

from transformers.

Lin-xs commented on August 21, 2024

Hey! I think this was recently fixed, so installing 4.42.xxx should work. I just tested locally: Make sure to install pip install -U transformers

Thank you @ArthurZucker , now the tokenizer works well. However, when I try to save and then load it, another error occurs:RuntimeError: Internal: could not parse ModelProto from ...

Code:

from transformers import AutoTokenizer
model_id = "QuantFactory/Meta-Llama-3-8B-GGUF"
filename = "Meta-Llama-3-8B.Q4_K_M.gguf"
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)

save_dir = '../../deq_models/test'
tokenizer.save_pretrained(save_dir)

tokenizer2 = AutoTokenizer.from_pretrained(save_dir)

The package version:
sentencepiece 0.2.0
transformers 4.42.3
Traceback:

{
	"name": "RuntimeError",
	"message": "Internal: could not parse ModelProto from ../../deq_models/test/tokenizer.model",
	"stack": "---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[6], line 1
----> 1 tokenizer2 = AutoTokenizer.from_pretrained(save_dir)

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py:889, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    885     if tokenizer_class is None:
    886         raise ValueError(
    887             f\"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported.\"
    888         )
--> 889     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    891 # Otherwise we have to be creative.
    892 # if model is an encoder decoder, the encoder tokenizer class is used by default
    893 if isinstance(config, EncoderDecoderConfig):

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2163, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
   2160     else:
   2161         logger.info(f\"loading file {file_path} from cache at {resolved_vocab_files[file_id]}\")
-> 2163 return cls._from_pretrained(
   2164     resolved_vocab_files,
   2165     pretrained_model_name_or_path,
   2166     init_configuration,
   2167     *init_inputs,
   2168     token=token,
   2169     cache_dir=cache_dir,
   2170     local_files_only=local_files_only,
   2171     _commit_hash=commit_hash,
   2172     _is_local=is_local,
   2173     trust_remote_code=trust_remote_code,
   2174     **kwargs,
   2175 )

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2397, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
   2395 # Instantiate the tokenizer.
   2396 try:
-> 2397     tokenizer = cls(*init_inputs, **init_kwargs)
   2398 except OSError:
   2399     raise OSError(
   2400         \"Unable to load vocabulary from file. \"
   2401         \"Please check that the provided vocabulary is accessible and not corrupted.\"
   2402     )

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama_fast.py:157, in LlamaTokenizerFast.__init__(self, vocab_file, tokenizer_file, clean_up_tokenization_spaces, unk_token, bos_token, eos_token, add_bos_token, add_eos_token, use_default_system_prompt, legacy, add_prefix_space, **kwargs)
    154 if add_prefix_space is not None:
    155     kwargs[\"from_slow\"] = True
--> 157 super().__init__(
    158     vocab_file=vocab_file,
    159     tokenizer_file=tokenizer_file,
    160     clean_up_tokenization_spaces=clean_up_tokenization_spaces,
    161     unk_token=unk_token,
    162     bos_token=bos_token,
    163     eos_token=eos_token,
    164     add_bos_token=add_bos_token,
    165     add_eos_token=add_eos_token,
    166     use_default_system_prompt=use_default_system_prompt,
    167     add_prefix_space=add_prefix_space,
    168     legacy=legacy,
    169     **kwargs,
    170 )
    171 self._add_bos_token = add_bos_token
    172 self._add_eos_token = add_eos_token

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py:131, in PreTrainedTokenizerFast.__init__(self, *args, **kwargs)
    127         kwargs.update(additional_kwargs)
    129 elif self.slow_tokenizer_class is not None:
    130     # We need to create and convert a slow tokenizer to build the backend
--> 131     slow_tokenizer = self.slow_tokenizer_class(*args, **kwargs)
    132     fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
    133 else:

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py:171, in LlamaTokenizer.__init__(self, vocab_file, unk_token, bos_token, eos_token, pad_token, sp_model_kwargs, add_bos_token, add_eos_token, clean_up_tokenization_spaces, use_default_system_prompt, spaces_between_special_tokens, legacy, add_prefix_space, **kwargs)
    169 self.add_eos_token = add_eos_token
    170 self.use_default_system_prompt = use_default_system_prompt
--> 171 self.sp_model = self.get_spm_processor(kwargs.pop(\"from_slow\", False))
    172 self.add_prefix_space = add_prefix_space
    174 super().__init__(
    175     bos_token=bos_token,
    176     eos_token=eos_token,
   (...)
    187     **kwargs,
    188 )

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py:198, in LlamaTokenizer.get_spm_processor(self, from_slow)
    196 tokenizer = spm.SentencePieceProcessor(**self.sp_model_kwargs)
    197 if self.legacy or from_slow:  # no dependency on protobuf
--> 198     tokenizer.Load(self.vocab_file)
    199     return tokenizer
    201 with open(self.vocab_file, \"rb\") as f:

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/sentencepiece/__init__.py:961, in SentencePieceProcessor.Load(self, model_file, model_proto)
    959 if model_proto:
    960   return self.LoadFromSerializedProto(model_proto)
--> 961 return self.LoadFromFile(model_file)

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/sentencepiece/__init__.py:316, in SentencePieceProcessor.LoadFromFile(self, arg)
    315 def LoadFromFile(self, arg):
--> 316     return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)

RuntimeError: Internal: could not parse ModelProto from ../../deq_models/test/tokenizer.model"
}

from transformers.

Lin-xs commented on August 21, 2024

Sorry for the misoperation.

from transformers.

itazap commented on August 21, 2024

I found that this is caused by setting add_prefix_space=False in GGUFLammaConverter. In turn, then the from_slow=True is forced from #28010 . I checked loading from "meta-llama/Meta-Llama-3-8B" and I don't believe the add_prefix_space=False [#30391] is necessary, I checked tokenization and a prefix space is not added when set to None. I can push a fix to change it to add_prefix_space=None (& test!) unless @ArthurZucker sees an issue with this?

@Lin-xs as a workaround for now, you can pass add_prefix_space=False like below to avoid the error!

AutoTokenizer.from_pretrained(model_id, gguf_file=filename, add_prefix_space=False)

from transformers.

The behavior of the tokenizer loaded from GGUF file is incorrect. about transformers HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent