SentencePiece is generally used

Thanks for the issue, <a class="user-mention notranslate" data-hovercard-type="user" d

Which Indian language specifically? I'm also looking at <a href="http://anoopkunchukut

Add Sentencepiece Tokeniser support about rasa-nlu-examples HOT 16 CLOSED

rasahq commented on June 3, 2024

Add Sentencepiece Tokeniser support

from rasa-nlu-examples.

Comments (16)

sara-tagger commented on June 3, 2024

Thanks for the issue, @koaning will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

from rasa-nlu-examples.

koaning commented on June 3, 2024

I'm wondering what a good approach here might be. The BytePair feature that we offer also does tokenization but only internally to the featurizer. We keep the original tokens intact that way and this is a requirement for our entity detection stack. For example;

My name is Vincent

This might get tokenized into

[My, name, is, Vin, cent]

As far as entity detection goes though, we want to return Vincent not [Vin, cent].

If we use this technique as a tokenizer we might have a benefit for intents, but it would break entities as well as some lexican features later on. Instead, it might be worth the experiment to see if we can maybe use these tokens in a featurizer internally instead. But since the feature might become heavy it would be nice to get some sort of a confirmation that this idea has merit to it. That it improves a pipeline in a way that the other components can't. Have you done any work on this?

from rasa-nlu-examples.

koaning commented on June 3, 2024

After talking with a colleague about this we wondered, have you ever worked with the ConveRT tokeniser? For English it should already tokenize into subtokens already and it can still be used by countvectors/DIET to generate internal representations.

from rasa-nlu-examples.

anuragshas commented on June 3, 2024

I am actually working on Indian Languages and lookup table doesn't seem to work, I have tried both WhiteSpace and stanza tokenisers, that's why I wanted to have custom pretrained tokeniser on my own data. I didn't find a way to train a polyai model(ConveRT) and there is no mention of other languages.

from rasa-nlu-examples.

koaning commented on June 3, 2024

Which Indian language specifically? I'm also looking at this library.

from rasa-nlu-examples.

anuragshas commented on June 3, 2024

I am working on Hindi right now, will expand to Tamil, Telugu and Kannada

from rasa-nlu-examples.

koaning commented on June 3, 2024

Have you tried any non-Sentencepiece tokenizers for those languages? I've googled and found a few but since I don't speak the languages I can't judge their quality. Have you seen this package or this one?

from rasa-nlu-examples.

anuragshas commented on June 3, 2024

These are trivial tokenizers which does word and sentence tokenization. It won't be much different from Whitespace tokenization, purna viram and deerga viram are the ones different from English but they are used for sentence boundaries.

from rasa-nlu-examples.

koaning commented on June 3, 2024

I'm open to the sentencepiece tokenizer as an experimental feature but we will need to keep in mind that the scope is just to generate these tokens for the intents for now. I fear that it is going to be very tricky to get this to work for entities but I'm interested in the experiment.

I've got no experience with SentencePiece so just to check. @anuragshas are these models available pretrained as well? We might need to think about a general corpus for different languages.

from rasa-nlu-examples.

anuragshas commented on June 3, 2024

I have tried something similar to ConveRT, it improved the entity f1 by 7 points but the entity predicted was subword and not the exact word as you had said earlier, even though there is the code for alignment

train_utils.align_tokens(split_token_strings, token_end, token_start)

with WhitespaceTokenizer tokens it doesn't seem to work

import os
from typing import Any, Dict, List, Text

import sentencepiece as spm

from rasa.nlu.tokenizers.tokenizer import Token
from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer
from rasa.nlu.training_data import Message
import rasa.utils.train_utils as train_utils


class SentencePieceTokenizer(WhitespaceTokenizer):

    defaults = {
        # Flag to check whether to split intents
        "intent_tokenization_flag": False,
        # Symbol on which intent should be split
        "intent_split_symbol": "_",
        # Text will be tokenized with case sensitive as default
        "case_sensitive": True,
        # specifies the path to a custom SentencePiece model file
        "model_file": None,
    }

    def __init__(self, component_config: Dict[Text, Any] = None) -> None:
        """Construct a new tokenizer using the SentencePiece framework."""

        super().__init__(component_config)

        model_file = self.component_config["model_file"]
        if model_file:
            if not os.path.exists(model_file):
                raise FileNotFoundError(
                    f"SentencePiece model {model_file} not found. Please check config."
                )
        self.model = spm.SentencePieceProcessor(model_file=model_file)

    def _tokenize(self, sentence: Text) -> Any:

        return self.model.encode(sentence, out_type=str)
    
    def tokenize(self, message: Message, attribute: Text) -> List[Token]:
        """Tokenize the text using the SentencePiece model.
        SentencePiece adds a special char in front of (some) words and splits words into
        sub-words. To ensure the entity start and end values matches the token values,
        tokenize the text first using the whitespace tokenizer. If individual tokens
        are split up into multiple tokens, add this information to the
        respected tokens.
        """

        # perform whitespace tokenization
        tokens_in = super().tokenize(message, attribute)

        tokens_out = []

        for token in tokens_in:
            token_start, token_end, token_text = token.start, token.end, token.text
            # use SentencePiece model to tokenize the text
            split_token_strings = self._tokenize(token_text)

            # clean tokens (remove special chars and empty tokens)
            split_token_strings = self._clean_tokens(split_token_strings)

            tokens_out += train_utils.align_tokens(
                split_token_strings, token_end, token_start
            )

        return tokens_out

    @staticmethod
    def _clean_tokens(tokens: List[bytes]) -> List[Text]:
        """Encode tokens and remove special char added by ConveRT."""

        tokens = [string.replace("_", "") for string in tokens]
        return [string for string in tokens if string]

BPEmb for English has the pretrained SentencePiece model on wikipedia with different vocab capacity. 10000 vocab model would be good to test with

from rasa-nlu-examples.

koaning commented on June 3, 2024

Interesting!

I have tried something similar to ConveRT, it improved the entity f1 by 7 points but the entity predicted was subword and not the exact word as you had said earlier, even though there is the code for alignment

Could you share some details on your config.yml file? Was it just using countvectors from the tokens and DIET? Also, what dataset was used?

BPEmb for English has the pretrained SentencePiece model on wikipedia with different vocab capacity. 10000 vocab model would be good to test with.

I didn't know but I indeed just checked, it seems to depend on the same library.

Also a quick question, the model_file that you're using here. Is that the same model file as from BPEmb or are you training your own?

from rasa-nlu-examples.

anuragshas commented on June 3, 2024

Here is my config.yml file

pipeline:
- name: rasa_nlu_examples.tokenizers.SentencePieceTokenizer
  lang: "hi"
  model_file: "w2v_models/hi.xyz.bpe.vs10000.model"
- name: RegexFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 15
- name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer
  lang: hi
  vs: 10000
  dim: 100
  model_file: "w2v_models/hi.xyz.bpe.vs10000.model"
  emb_file: "w2v_models/hi.xyz.bpe.vs10000.d100.w2v.bin"
- name: DIETClassifier
  epochs: 200
- name: EntitySynonymMapper
language: "hi"
policies:
- name: TEDPolicy
  epochs: 1
  max_history: 3
  batch_size:
  - 32
  - 64
- name: MappingPolicy
- name: AugmentedMemoizationPolicy
- name: TwoStageFallbackPolicy
  nlu_threshold: 0.3
  core_threshold: 0.3
  fallback_core_action_name: "action_default_fallback"
  fallback_nlu_action_name: "action_default_fallback"
  deny_suggestion_intent_name: "out_of_scope"

Also, what dataset was used?

The dataset is translated text in Hindi in Pharma domain. I am not allowed to share it publicly.

I didn't know but I indeed just checked, it seems to depend on the same library

It is dependent only on sentencepiece library which is also a dependency of BPEmb. *.model and *.vocab are the files of sentencepiece model, *.bin is gensim keyedvectors file which contains glove trained word2vec

Also a quick question, the model_file that you're using here. Is that the same model file as from BPEmb or are you training your own?

I had trained my own model using something similar to this

from rasa-nlu-examples.

koaning commented on June 3, 2024

Again, interesting! The dataset that you trained on is a more general dataset than just your Rasa corpus? Also, did it work better than the standard BytePair embeddings?

You also mentioned a 7 point increase. Can you share anything about the size of your dataset? How many intents/entities/examples? Anything you can share about the domain?

from rasa-nlu-examples.

koaning commented on June 3, 2024

One thing I wonder, could you use the WhiteSpaceTokenizer with you current rasa_nlu_examples.featurizers.dense.BytePairFeaturizer models? The intent performance might remain the same. Looking at the current implementation we just take the full text, not the tokens separately, to detect the intent.

It might be good to check. If the performance doesn't change too much then we might focus on writing a tool that makes it easier to train your own byte-pair embeddings for Rasa.

from rasa-nlu-examples.

koaning commented on June 3, 2024

Just a heads up. I can't make any promises on when it will be done. But I am now working on this.

from rasa-nlu-examples.

koaning commented on June 3, 2024

This feature has been taken care of, at least partially, by our language model featurizer.

from rasa-nlu-examples.

Add Sentencepiece Tokeniser support about rasa-nlu-examples HOT 16 CLOSED

Comments (16)

You may find help in the docs and the forum, too 🤗

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent