Comments (16)
Thanks for the issue, @koaning will get back to you about it soon!
You may find help in the docs and the forum, too 🤗
from rasa-nlu-examples.
I'm wondering what a good approach here might be. The BytePair feature that we offer also does tokenization but only internally to the featurizer. We keep the original tokens intact that way and this is a requirement for our entity detection stack. For example;
My name is Vincent
This might get tokenized into
[My, name, is, Vin, cent]
As far as entity detection goes though, we want to return Vincent
not [Vin, cent]
.
If we use this technique as a tokenizer we might have a benefit for intents, but it would break entities as well as some lexican features later on. Instead, it might be worth the experiment to see if we can maybe use these tokens in a featurizer internally instead. But since the feature might become heavy it would be nice to get some sort of a confirmation that this idea has merit to it. That it improves a pipeline in a way that the other components can't. Have you done any work on this?
from rasa-nlu-examples.
After talking with a colleague about this we wondered, have you ever worked with the ConveRT
tokeniser? For English it should already tokenize into subtokens already and it can still be used by countvectors/DIET to generate internal representations.
from rasa-nlu-examples.
I am actually working on Indian Languages and lookup table doesn't seem to work, I have tried both WhiteSpace and stanza tokenisers, that's why I wanted to have custom pretrained tokeniser on my own data. I didn't find a way to train a polyai model(ConveRT
) and there is no mention of other languages.
from rasa-nlu-examples.
Which Indian language specifically? I'm also looking at this library.
from rasa-nlu-examples.
I am working on Hindi right now, will expand to Tamil, Telugu and Kannada
from rasa-nlu-examples.
Have you tried any non-Sentencepiece tokenizers for those languages? I've googled and found a few but since I don't speak the languages I can't judge their quality. Have you seen this package or this one?
from rasa-nlu-examples.
These are trivial tokenizers which does word and sentence tokenization. It won't be much different from Whitespace tokenization, purna viram and deerga viram are the ones different from English but they are used for sentence boundaries.
from rasa-nlu-examples.
I'm open to the sentencepiece tokenizer as an experimental feature but we will need to keep in mind that the scope is just to generate these tokens for the intents for now. I fear that it is going to be very tricky to get this to work for entities but I'm interested in the experiment.
I've got no experience with SentencePiece so just to check. @anuragshas are these models available pretrained as well? We might need to think about a general corpus for different languages.
from rasa-nlu-examples.
I have tried something similar to ConveRT
, it improved the entity f1 by 7 points but the entity predicted was subword and not the exact word as you had said earlier, even though there is the code for alignment
train_utils.align_tokens(split_token_strings, token_end, token_start)
with WhitespaceTokenizer tokens it doesn't seem to work
import os
from typing import Any, Dict, List, Text
import sentencepiece as spm
from rasa.nlu.tokenizers.tokenizer import Token
from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer
from rasa.nlu.training_data import Message
import rasa.utils.train_utils as train_utils
class SentencePieceTokenizer(WhitespaceTokenizer):
defaults = {
# Flag to check whether to split intents
"intent_tokenization_flag": False,
# Symbol on which intent should be split
"intent_split_symbol": "_",
# Text will be tokenized with case sensitive as default
"case_sensitive": True,
# specifies the path to a custom SentencePiece model file
"model_file": None,
}
def __init__(self, component_config: Dict[Text, Any] = None) -> None:
"""Construct a new tokenizer using the SentencePiece framework."""
super().__init__(component_config)
model_file = self.component_config["model_file"]
if model_file:
if not os.path.exists(model_file):
raise FileNotFoundError(
f"SentencePiece model {model_file} not found. Please check config."
)
self.model = spm.SentencePieceProcessor(model_file=model_file)
def _tokenize(self, sentence: Text) -> Any:
return self.model.encode(sentence, out_type=str)
def tokenize(self, message: Message, attribute: Text) -> List[Token]:
"""Tokenize the text using the SentencePiece model.
SentencePiece adds a special char in front of (some) words and splits words into
sub-words. To ensure the entity start and end values matches the token values,
tokenize the text first using the whitespace tokenizer. If individual tokens
are split up into multiple tokens, add this information to the
respected tokens.
"""
# perform whitespace tokenization
tokens_in = super().tokenize(message, attribute)
tokens_out = []
for token in tokens_in:
token_start, token_end, token_text = token.start, token.end, token.text
# use SentencePiece model to tokenize the text
split_token_strings = self._tokenize(token_text)
# clean tokens (remove special chars and empty tokens)
split_token_strings = self._clean_tokens(split_token_strings)
tokens_out += train_utils.align_tokens(
split_token_strings, token_end, token_start
)
return tokens_out
@staticmethod
def _clean_tokens(tokens: List[bytes]) -> List[Text]:
"""Encode tokens and remove special char added by ConveRT."""
tokens = [string.replace("_", "") for string in tokens]
return [string for string in tokens if string]
BPEmb for English has the pretrained SentencePiece model on wikipedia with different vocab capacity. 10000 vocab model would be good to test with
from rasa-nlu-examples.
Interesting!
I have tried something similar to ConveRT, it improved the entity f1 by 7 points but the entity predicted was subword and not the exact word as you had said earlier, even though there is the code for alignment
Could you share some details on your config.yml
file? Was it just using countvectors from the tokens and DIET? Also, what dataset was used?
BPEmb for English has the pretrained SentencePiece model on wikipedia with different vocab capacity. 10000 vocab model would be good to test with.
I didn't know but I indeed just checked, it seems to depend on the same library.
Also a quick question, the model_file
that you're using here. Is that the same model file as from BPEmb
or are you training your own?
from rasa-nlu-examples.
Here is my config.yml
file
pipeline:
- name: rasa_nlu_examples.tokenizers.SentencePieceTokenizer
lang: "hi"
model_file: "w2v_models/hi.xyz.bpe.vs10000.model"
- name: RegexFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 15
- name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer
lang: hi
vs: 10000
dim: 100
model_file: "w2v_models/hi.xyz.bpe.vs10000.model"
emb_file: "w2v_models/hi.xyz.bpe.vs10000.d100.w2v.bin"
- name: DIETClassifier
epochs: 200
- name: EntitySynonymMapper
language: "hi"
policies:
- name: TEDPolicy
epochs: 1
max_history: 3
batch_size:
- 32
- 64
- name: MappingPolicy
- name: AugmentedMemoizationPolicy
- name: TwoStageFallbackPolicy
nlu_threshold: 0.3
core_threshold: 0.3
fallback_core_action_name: "action_default_fallback"
fallback_nlu_action_name: "action_default_fallback"
deny_suggestion_intent_name: "out_of_scope"
Also, what dataset was used?
The dataset is translated text in Hindi in Pharma domain. I am not allowed to share it publicly.
I didn't know but I indeed just checked, it seems to depend on the same library
It is dependent only on sentencepiece library which is also a dependency of BPEmb. *.model
and *.vocab
are the files of sentencepiece model, *.bin
is gensim keyedvectors file which contains glove trained word2vec
Also a quick question, the model_file that you're using here. Is that the same model file as from BPEmb or are you training your own?
I had trained my own model using something similar to this
from rasa-nlu-examples.
Again, interesting! The dataset that you trained on is a more general dataset than just your Rasa corpus? Also, did it work better than the standard BytePair embeddings?
You also mentioned a 7 point increase. Can you share anything about the size of your dataset? How many intents/entities/examples? Anything you can share about the domain?
from rasa-nlu-examples.
One thing I wonder, could you use the WhiteSpaceTokenizer
with you current rasa_nlu_examples.featurizers.dense.BytePairFeaturizer
models? The intent
performance might remain the same. Looking at the current implementation we just take the full text, not the tokens separately, to detect the intent.
It might be good to check. If the performance doesn't change too much then we might focus on writing a tool that makes it easier to train your own byte-pair embeddings for Rasa.
from rasa-nlu-examples.
Just a heads up. I can't make any promises on when it will be done. But I am now working on this.
from rasa-nlu-examples.
This feature has been taken care of, at least partially, by our language model featurizer.
from rasa-nlu-examples.
Related Issues (20)
- Numpy incomptaible version due Tensorflow dependency conflit using Pipenv HOT 3
- HashingVectorizer HOT 6
- RASA NLU getting this error 'str' object has no attribute 'get' HOT 1
- Gensim Featurizer - Object has no attribute vocab HOT 4
- FlashTextEntityExtractor does not list entity
- Warning for stopwordremover HOT 6
- SciKit not compatible with RASA 3.0 HOT 1
- Make the repository Compatible for Rasa 3.0 HOT 1
- Port `rasa_nlu_examples.classifiers.SparseNaiveBayesIntentClassifier` to 3.0 HOT 1
- Cached Usage of SparseBytePairFeaturizer docs contains dense BytePairFeaturizer HOT 4
- Implementation HOT 9
- Error initializing graph component HOT 12
- ModuleNotFoundError HOT 1
- Installation taking lots of time HOT 2
- SparseNaiveBayesClassifier not working in rasa docker HOT 2
- no response from rest API
- Installation not working HOT 4
- Error initializing graphComponent with GensimFeaturizer HOT 1
- How to create custom component for sentiment analyzer in Rasa 3.X.
- rasa version error HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rasa-nlu-examples.