rasahq / rasa-nlu-examples Goto Github PK

View Code? Open in Web Editor NEW

188.0 14.0 78.0 3.79 MB

This repository contains examples of custom components for educational purposes.

Home Page: https://RasaHQ.github.io/rasa-nlu-examples/

License: Apache License 2.0

Python 99.62% Makefile 0.38%

rasa rasa-nlu

rasa-nlu-examples's Issues

Standardise Ways to Share Results

We'd love it if people could let us know what tools help them. With that in mind, it would be good to consider a page on the documentation where we can list results. It'd probably be best to seperate the results per language per project?

This thread is to collect ideas on this topic.

Non-English Namelists

This is a response to this question and this one. It seems clear that a pre-trained French model can't be expected to detect names that aren't from France very well. The questions on the forum are about French but I can certainly imagine this issue also being relevant for other languages too.

So as a next-best-idea. We should host common names per country somewhere so folks might use it as a lookup for Regex. This needs to be a community effort but it can be very helpful.

Indic NLP

Something to keep an eye on:

https://anoopkunchukuttan.github.io/indic_nlp_library/

If we receive feedback that this tool could be useful to add here, we should.

snowballstemmer

I've just been told it might work well for Turkish.

https://pypi.org/project/snowballstemmer/

Benchmarking Results

We want to add a portion to the documentation where people can share some results of the tools in this library. We'll gladly link to any blogpost/github project as well but we'd love to hear/list in which scenarios our tools make a difference.

Anybody with results is free to ping @koaning here.

Support 2.0

Are you going to adapt examples to 2.0 version?

Add Sentencepiece Tokeniser support

SentencePiece is generally used to create byte pairs in any language, as I can find there is no inbuilt support for this kind of tokenisation in rasa. Even though this library uses BPEmb but it is only limited for pretrained embeddings and not tokenisation, since Whitespace tokeniser doesn't always perform good, i would like to have support for it. I am willing to do PR for this, but I don't know about the contribution steps here.

spaCy and numerical modifiers

Might be good to map numeric modifiers to entities.

https://forum.rasa.com/t/how-to-map-multiple-items-and-their-quantities-rasa-x/38571/11

deploy rasa-nlu-examples with docker

hi every one,
I trained a new just-nlu-model with my own w2v model. now, i want to put it in production/staging Env.
but official version of rasa docker not support this project. can i ask you show me how create a docker file from official rasa docker to use my own w2v model?

missing GensimFeaturizer module after pip install git+

I put Gensimfeaturizer in rasa nlu pipeline and got this message:

Exception: Failed to find class 'GensimFeaturizer' in module 'rasa_nlu_examples.featurizers.dense'.

After try uninstall and install with pip git and python pip git, clone master branch with last commit and ... but the problem not solved.
finally, i found that the class .py file not present so i copy/paste gensim_featurizer.py in package installed folder(/rasa_nlu_example/featurizer/dense) and it did work!

Experiment with `padatious`

There was a package suggested here that might be worth exploring. It is suggested that it works really well on very small datasets. The project can be found here.

Printer should use Rich

The printer should print pretty objects. The rich library might be able to make a difference. It might make the output a lot clearer.

Before an implementation happens a path forward should be discussed here. Are we going to use tables? Colors?

Add LogisticRegression

I'd like to add another intent classification model here. Similar to the naive bayes model.

BlankSpacyTokenizer

Spacy has lots of blank tokenizers. They are rule-based, and therefore they should also technically have features that our simple white-space tokenizer doesn't.

Feature Request: spaCy POS tags

A lot of entities will be nouns. Even if we don't use spaCy as an entity detection engine, it does come with useful language detection features that might be useful for our pipeline. Would be worth an experiment.

NLU stopwords

The Rasa Countvectorizer currently has a stopwords attribute, but it unfortunately only works for the sparse features. Any word embeddings that belong to a stopword are still generated. It might make sense to build a component that actually removes the stopwords from the message before it is handled by other components.

Speed Up Tests

Currently we're using the cli runner to run smoke-tests. This is a good idea but it is incredibly slow. It has to close down and load up tensorflow at every pass and this is slowing down the CI on github.

It's probably much better to instead remove the cli and instead run the command that is triggered from the cli instead from within python.

FlashTextEntityExtractor

Regex can be slow. Instead, it might help to have an entity extractor that is based on flashtext. This can really help for long name-lists for example.

Printer object needs to be configurable.

I've noticed that we might only be interested in printing information about certain properties in the pipeline sofar. It'd be nice if we could configure the component to allow for that.

Doc2Vec instead of Word2Vec model for Gensim featurizer?

Wondering if doc2vec, instead of word2vec, model can be trained and deployed to Gensim featurizer?

Also if HFTransformer is chosen as the Language Model, thus the corresponding tokenizer and featurizer are used in NLU pipeline, can the Gensim featurizer can still be added to the pipeline to improve domain specific processing?

Thanks.

Dependency Management

It might be better to make it the users' responsibility to handle the dependencies and now have them install immediately via pip. I don't think we can use the [all] syntax when we're using github but we could explore that.

In any case, we really don't want folks to download the thai-language tools or pytorch if they're only using the bytepair embeddings.

Adding Stanza

In places where FastText wrapped into spaCy is no use, Stanza comes in handy - it can give us the necessary POS-es and lemmatization. It is, at least, the case for Estonian. Should be also for Finnish, Hebrew, Hindi, Hungarian, Indonesian, Irish, Korean, Latvian, Persian, Telugu, Urdu, etc.

Feature Request: Gensim Key-Value Paris

Gensim probably offers the easier way to train your own embeddings which might allow for users to use their own if they have a corpus that is reliable. I've understood that wikipedia is not a reliable source of online-slang for many languages.

add support for fasttext language detection

Fasttext has a tool for this. Might be nice to check if we can use it in a component.

Could it trigger a RulePolicy?

RasaClassifier is currently influenced by `ranking_length`.

As reported RasaHQ/rasalit#37, it's something that needs fixing.

Fix printer intents

Issue described in detail here.

Confirm WhiteSpaceTokenizer for Arabic

I'm wondering if the issue here isn't propagated to our WhitespaceTokenizer. It'd be good to confirm it's not an issue.

fugashi

japanese tools https://www.aclweb.org/anthology/2020.nlposs-1.7/

fix docs link

Stanza currently points to Thai.

Investigate CLTK

I found this project on github and it might offer tools interesting for, among other languages, Hindi. I do not know if the tokenizers are of high quality but they are documented here.

zemberek

From Tensorflow Turkey meetup. Let's investigate if we can add it here.

Turkish NLU data

I'd like to add Turkish NLU data if there isn't anyone else doing it at the moment.

Remove occurrences of `token_pattern` parameter from `CountVectorFeaturizer` config

As per the discussion on Slack here (summary: we've decided against exposing token_pattern and use analyzer instead) we should remove these occurrences and replace them with the appropriate analyzer argument (found in benchmarking.md and readme.md)

Unclear error message when file is not found

Currently when the file is not found at the specified cache_dir in the config.yml file, the error message is very opaque:

ValueError: /path/to/file/wiki.es.bin cannot be opened for loading!

The problem was that the file wiki.es.bin does not exist at /path/to/file, though this is not at all obvious from the error message.

read paper CAMEL

https://www.aclweb.org/anthology/2020.lrec-1.868/

CustomPythonComponent

Goal

To make hacking around easier for our research department, it might make sense to have a component that can just apply a function on a message.

The idea is that you as a user can define a file, say custom_component.py like;

# custom_component.py
from rasa_nlu_examples.meta import CustomPythonComponent
model = load_fasttext()

def fasttext(message, setting_a=1, setting_b=2):
    """this is pseudocode"""
    model.process(message) 
    return message # this message now has extra features attached

MyFastTextTool = CustomPythonComponent(fasttext, setting_a=1, setting_b=2)

Once such a file is around your project, it'd be cool if you could do;

language: en

pipeline:
- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: rasa_nlu_examples.meta.Printer
  alias: before count vectors
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: rasa_nlu_examples.meta.Printer
  alias: after count vectors
- name: custom_component.MyFastTextTool
  setting_a: 1
  setting_b: 2
- name: DIETClassifier
  epochs: 100

This should make it much easier to add a featurizer. We shouldn't implement internal tools with it, but this might allow for some experimentation with actual rasa tools as opposed to jupyter notebooks.

Arabic Tokenizer

https://github.com/ARBML/tkseem

Tokenizers for Less Common Languages

The whitespace tokenizer in Rasa is focussed on western languages. If there are languages who appreciate a different tokenizer then we might explore alternatives in this thread.

Github Workflow does not work with BytePairFeaturizer anymore, because FastTextFeaturizer can't be found

Hey there, I am using a CI/CD pipeline on github for a while with installing rasa nlu examples, training and testing the model.
It worked without any problems.
Today the workflow fails and i get this error:

ComponentNotFoundException: Failed to load the component 'rasa_nlu_examples.featurizers.dense.BytePairFeaturizer'. Failed to find module 'rasa_nlu_examples.featurizers.dense'. Either your pipeline configuration contains an error or the module you are trying to import is broken (e.g. the module is trying to import a package that is not installed). Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/rasa/nlu/registry.py", line 121, in get_component_class
    return rasa.shared.utils.common.class_from_module_path(component_name)
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/rasa/shared/utils/common.py", line 20, in class_from_module_path
    m = importlib.import_module(module_name)
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/rasa_nlu_examples/featurizers/dense/__init__.py", line 1, in <module>
    from .fasttext_featurizer import FastTextFeaturizer
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/rasa_nlu_examples/featurizers/dense/fasttext_featurizer.py", line 5, in <module>
    import fasttext
ModuleNotFoundError: No module named 'fasttext'

Error: Process completed with exit code 1.

but found this error


(binus) Wellys-MacBook-Pro:rasa-demo wellytambunan$ rasa test nlu --config basic-bytepair.config.yml           --cross-validation --runs 1 --folds 2           --out gridresults/basic-bytepair-config
2020-08-14 10:22:31 INFO     rasa.cli.test  - Test model using cross validation.
Traceback (most recent call last):
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/bin/rasa", line 8, in <module>
    sys.exit(main())
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/__main__.py", line 92, in main
    cmdline_arguments.func(cmdline_arguments)
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/cli/test.py", line 147, in run_nlu_test
    perform_nlu_cross_validation(config, nlu_data, output, vars(args))
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/test.py", line 243, in perform_nlu_cross_validation
    data, folds, nlu_config, output, **kwargs
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/test.py", line 1354, in cross_validate
    trainer = Trainer(nlu_config)
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/model.py", line 142, in __init__
    components.validate_requirements(cfg.component_names)
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/components.py", line 46, in validate_requirements
    from rasa.nlu import registry
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/registry.py", line 13, in <module>
    from rasa.nlu.classifiers.diet_classifier import DIETClassifier
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 9, in <module>
    import tensorflow_addons as tfa
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/__init__.py", line 21, in <module>
    from tensorflow_addons import activations
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/activations/__init__.py", line 21, in <module>
    from tensorflow_addons.activations.gelu import gelu
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/activations/gelu.py", line 24, in <module>
    get_path_to_datafile("custom_ops/activations/_activation_ops.so"))
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 58, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
**tensorflow.python.framework.errors_impl.NotFoundError:** dlopen(/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/custom_ops/activations/_activation_ops.so, 6): **Symbol not found:** __ZN10tensorflow11GetNodeAttrERKNS_9AttrSliceEN4absl11string_viewEPb
  Referenced from: /Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/custom_ops/activations/_activation_ops.so
  Expected in: /Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.2.dylib
 in /Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/custom_ops/activations/_activation_ops.so
(binus) Wellys-MacBook-Pro:rasa-demo wellytambunan$

is_oov: is the token part of the vocabulary/does it have a vector?
is_stop: is the token a stopword?
lemma_: what is the lemma of the token
pos/tag coarse/fine-grained part of speech information
morphological features
grammatical dependency

These can all have a discrete representation and could be added in general to a Rasa pipeline.

Missing `prepare_everything.py`

The make install command ends with

python tests/prepare_everything.py
python: can't open file 'tests/prepare_everything.py': [Errno 2] No such file or directory
make: *** [Makefile:4: install] Error 2

Also, the documentation refers to this file, but it doesn't exist.

klpt

kurdish https://www.aclweb.org/anthology/2020.nlposs-1.11.pdf

rasahq / rasa-nlu-examples Goto Github PK

rasa-nlu-examples's Issues

Recommend Projects

Recommend Topics

Recommend Org