Giter Site home page Giter Site logo

Comments (5)

ArthurZucker avatar ArthurZucker commented on August 21, 2024 1

cc @itazap

from transformers.

ArthurZucker avatar ArthurZucker commented on August 21, 2024

Hey! I think this was recently fixed, so installing 4.42.xxx should work. I just tested locally:
image
Make sure to install pip install -U transformers

from transformers.

Lin-xs avatar Lin-xs commented on August 21, 2024

Hey! I think this was recently fixed, so installing 4.42.xxx should work. I just tested locally: image Make sure to install pip install -U transformers

Thank you @ArthurZucker , now the tokenizer works well. However, when I try to save and then load it, another error occurs:RuntimeError: Internal: could not parse ModelProto from ...

Code:

from transformers import AutoTokenizer
model_id = "QuantFactory/Meta-Llama-3-8B-GGUF"
filename = "Meta-Llama-3-8B.Q4_K_M.gguf"
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)

save_dir = '../../deq_models/test'
tokenizer.save_pretrained(save_dir)

tokenizer2 = AutoTokenizer.from_pretrained(save_dir)

The package version:
sentencepiece 0.2.0
transformers 4.42.3
Traceback:

{
	"name": "RuntimeError",
	"message": "Internal: could not parse ModelProto from ../../deq_models/test/tokenizer.model",
	"stack": "---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[6], line 1
----> 1 tokenizer2 = AutoTokenizer.from_pretrained(save_dir)

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py:889, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    885     if tokenizer_class is None:
    886         raise ValueError(
    887             f\"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported.\"
    888         )
--> 889     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    891 # Otherwise we have to be creative.
    892 # if model is an encoder decoder, the encoder tokenizer class is used by default
    893 if isinstance(config, EncoderDecoderConfig):

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2163, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
   2160     else:
   2161         logger.info(f\"loading file {file_path} from cache at {resolved_vocab_files[file_id]}\")
-> 2163 return cls._from_pretrained(
   2164     resolved_vocab_files,
   2165     pretrained_model_name_or_path,
   2166     init_configuration,
   2167     *init_inputs,
   2168     token=token,
   2169     cache_dir=cache_dir,
   2170     local_files_only=local_files_only,
   2171     _commit_hash=commit_hash,
   2172     _is_local=is_local,
   2173     trust_remote_code=trust_remote_code,
   2174     **kwargs,
   2175 )

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2397, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
   2395 # Instantiate the tokenizer.
   2396 try:
-> 2397     tokenizer = cls(*init_inputs, **init_kwargs)
   2398 except OSError:
   2399     raise OSError(
   2400         \"Unable to load vocabulary from file. \"
   2401         \"Please check that the provided vocabulary is accessible and not corrupted.\"
   2402     )

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama_fast.py:157, in LlamaTokenizerFast.__init__(self, vocab_file, tokenizer_file, clean_up_tokenization_spaces, unk_token, bos_token, eos_token, add_bos_token, add_eos_token, use_default_system_prompt, legacy, add_prefix_space, **kwargs)
    154 if add_prefix_space is not None:
    155     kwargs[\"from_slow\"] = True
--> 157 super().__init__(
    158     vocab_file=vocab_file,
    159     tokenizer_file=tokenizer_file,
    160     clean_up_tokenization_spaces=clean_up_tokenization_spaces,
    161     unk_token=unk_token,
    162     bos_token=bos_token,
    163     eos_token=eos_token,
    164     add_bos_token=add_bos_token,
    165     add_eos_token=add_eos_token,
    166     use_default_system_prompt=use_default_system_prompt,
    167     add_prefix_space=add_prefix_space,
    168     legacy=legacy,
    169     **kwargs,
    170 )
    171 self._add_bos_token = add_bos_token
    172 self._add_eos_token = add_eos_token

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py:131, in PreTrainedTokenizerFast.__init__(self, *args, **kwargs)
    127         kwargs.update(additional_kwargs)
    129 elif self.slow_tokenizer_class is not None:
    130     # We need to create and convert a slow tokenizer to build the backend
--> 131     slow_tokenizer = self.slow_tokenizer_class(*args, **kwargs)
    132     fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
    133 else:

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py:171, in LlamaTokenizer.__init__(self, vocab_file, unk_token, bos_token, eos_token, pad_token, sp_model_kwargs, add_bos_token, add_eos_token, clean_up_tokenization_spaces, use_default_system_prompt, spaces_between_special_tokens, legacy, add_prefix_space, **kwargs)
    169 self.add_eos_token = add_eos_token
    170 self.use_default_system_prompt = use_default_system_prompt
--> 171 self.sp_model = self.get_spm_processor(kwargs.pop(\"from_slow\", False))
    172 self.add_prefix_space = add_prefix_space
    174 super().__init__(
    175     bos_token=bos_token,
    176     eos_token=eos_token,
   (...)
    187     **kwargs,
    188 )

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py:198, in LlamaTokenizer.get_spm_processor(self, from_slow)
    196 tokenizer = spm.SentencePieceProcessor(**self.sp_model_kwargs)
    197 if self.legacy or from_slow:  # no dependency on protobuf
--> 198     tokenizer.Load(self.vocab_file)
    199     return tokenizer
    201 with open(self.vocab_file, \"rb\") as f:

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/sentencepiece/__init__.py:961, in SentencePieceProcessor.Load(self, model_file, model_proto)
    959 if model_proto:
    960   return self.LoadFromSerializedProto(model_proto)
--> 961 return self.LoadFromFile(model_file)

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/sentencepiece/__init__.py:316, in SentencePieceProcessor.LoadFromFile(self, arg)
    315 def LoadFromFile(self, arg):
--> 316     return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)

RuntimeError: Internal: could not parse ModelProto from ../../deq_models/test/tokenizer.model"
}

from transformers.

Lin-xs avatar Lin-xs commented on August 21, 2024

Sorry for the misoperation.

from transformers.

itazap avatar itazap commented on August 21, 2024

I found that this is caused by setting add_prefix_space=False in GGUFLammaConverter. In turn, then the from_slow=True is forced from #28010 . I checked loading from "meta-llama/Meta-Llama-3-8B" and I don't believe the add_prefix_space=False [#30391] is necessary, I checked tokenization and a prefix space is not added when set to None. I can push a fix to change it to add_prefix_space=None (& test!) unless @ArthurZucker sees an issue with this?

@Lin-xs as a workaround for now, you can pass add_prefix_space=False like below to avoid the error!

AutoTokenizer.from_pretrained(model_id, gguf_file=filename, add_prefix_space=False)

from transformers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.