rasahq / rasa-nlu-examples Goto Github PK

View Code? Open in Web Editor NEW

189.0 14.0 77.0 3.79 MB

This repository contains examples of custom components for educational purposes.

Home Page: https://RasaHQ.github.io/rasa-nlu-examples/

License: Apache License 2.0

Python 99.62% Makefile 0.38%

rasa rasa-nlu

rasa-nlu-examples's Introduction

Rasa NLU Examples

This repository contains Rasa compatible machine learning components. These components are open sourced in order to encourage experimentation and to quickly offer support to more tools. By hosting these components here they do not need to go through the same vetting process as the components in Rasa and we hope that this makes it easier for people to contribute new ideas.

The components in the repository are not officially supported. There will be units tests as well as documentation but this project should be considered a community project, not something that is part of core Rasa. If there's a component here that turns out to be useful to the larger Rasa community then we might port features from this repository to Rasa.

Install

To use these tools locally you need to install via git.

python -m pip install "rasa_nlu_examples @ git+https://github.com/RasaHQ/rasa-nlu-examples.git"

Note that if you want to install optional dependencies as well that you'll need to run:

python -m pip install "rasa_nlu_examples[flashtext] @ git+https://github.com/RasaHQ/rasa-nlu-examples.git"
python -m pip install "rasa_nlu_examples[dateparser] @ git+https://github.com/RasaHQ/rasa-nlu-examples.git"
python -m pip install "rasa_nlu_examples[thai] @ git+https://github.com/RasaHQ/rasa-nlu-examples.git"
python -m pip install "rasa_nlu_examples[fasttext] @ git+https://github.com/RasaHQ/rasa-nlu-examples.git"
python -m pip install "rasa_nlu_examples[all] @ git+https://github.com/RasaHQ/rasa-nlu-examples.git"

If you're using any models that depend on spaCy you'll need to install the Rasa dependencies for spaCy first.

python -m pip install rasa[spacy]

Documentation

You can find the documentation for this project here.

Compatibility

This project currently supports components for Rasa 3.0. For older versions, see the list below.

version 0.1.3 is the final release for Rasa 1.10
version 0.2.8 is the final release for Rasa 2.8

Tokenizers

Tokenizers can split up the input text into tokens. Depending on the Tokenizer that you pick you can also choose to apply lemmatization. For languages that have rich grammatical features this might help reduce the size of all the possible tokens.

rasa_nlu_examples.tokenizers.BlankSpacyTokenizer docs
rasa_nlu_examples.tokenizers.ThaiTokenizer docs

Featurizers

Dense featurizers attach dense numeric features per token as well as to the entire utterance. These features are picked up by intent classifiers and entity detectors later in the pipeline.

rasa_nlu_examples.featurizers.dense.FastTextFeaturizer docs
rasa_nlu_examples.featurizers.dense.BytePairFeaturizer docs
rasa_nlu_examples.featurizers.dense.GensimFeaturizer docs
rasa_nlu_examples.featurizers.sparse.TfIdfFeaturizer docs
rasa_nlu_examples.featurizers.sparse.HashingFeaturizer docs

Intent Classifiers

Intent classifiers are models that predict an intent from a given user message text. The default intent classifier in Rasa NLU is the DIET model which can be fairly computationally expensive, especially if you do not need to detect entities. We provide some examples of alternative intent classifiers here.

rasa_nlu_examples.classifiers.NaiveBayesClassifier docs
rasa_nlu_examples.classifiers.LogisticRegressionClassifier docs

Entity Extractors

rasa_nlu_examples.extractor.FlashTextEntityExtractor docs
rasa_nlu_examples.extractor.DateparserEntityExtractor docs

Name Lists

Language models are typically trained on Western datasets. That means that the reported benchmarks might not apply to your use-case. For example; detecting names in texts from France is not the same thing as detecting names in Madagascar. Even though French is used actively in both countries, the names of it's citizens might be so different that you cannot assume that the benchmarks apply universally.

To remedy this we've started collecting name lists. These can be used as a lookup table which can be picked up by Rasa's RegexEntityExtractor or our FlashTextEntityExtractor. It won't be 100% perfect but it should give a reasonable starting point.

You can find the namelists here. We currently offer namelists for the United States, Germany as well as common Arabic names. Feel free to submit PRs for more languages. We're also eager to receive feedback.

Usage

You can install the examples from this repo via pip. Let's say you download the standard tools.

pip install git+https://github.com/RasaHQ/rasa-nlu-examples

Once installed you can add tools to your config.yml file, here's an example;

language: en
pipeline:
- name: WhitespaceTokenizer
- name: CountVectorsFeaturizer
  OOV_token: oov.txt
  analyzer: word
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer
  lang: en
  vs: 1000
  dim: 25
- name: DIETClassifier
  epochs: 200

An example config for using the Thai tokenizer would look like:

language: th
pipeline:
  - name: rasa_nlu_examples.tokenizers.ThaiTokenizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 200

And you can use this file to run benchmarks. From the root folder of the project typically that means running something like;

rasa test nlu --config basic-bytepair-config.yml \
          --cross-validation --runs 1 --folds 2 \
          --out gridresults/basic-bytepair-config

Open an Issue

If you've spotted a bug then you can submit an issue here. GitHub issues allow us to keep track of a conversation about this repository and it is the preferred communication channel for bugs related to this project.

Contribute

There are many ways you can contribute to this project.

You can suggest new features.
You can let us know if there are bugs.
You can share the results of an experiment you ran using these tools.
You can let us know if the components in this library help you.

Feel free to start the discussion by opening an issue on this repository. Before submitting code to the repository it would help if you first create an issue so that the maintainers can disucss the changes you would like to contribute. A more in-depth contribution guide can be found here.

To get started locally you can run:

python -m pip install -e ".[dev]"
pre-commit install
python tests/scripts/prepare_fasttext.py

Alternatively you may also run this via the Makefile:

make install

Documentation

If you want to build the documentation locally you can do so via;

mkdocs serve

If you want to deplot the docs to GitHub you can run;

mkdocs gh-deploy

This will deploy a new version to the docs branch, which is picked up by GitHub pages.

rasa-nlu-examples's People

Contributors

Stargazers

Watchers

Forkers

juliangerhard21 sysang govindarajan svemulapalli stjordanis msgpo socios-linux anuragshas ngtrang vishnupriyavr dashayushman imanearaf masterzenith abolfazlvakily aristidednd kedz cirrushuet mariocarranzadiez souvikg10 koaning verakai serkanars johan1us steliord bikashgup armanjindal kanka-max kjellvb-ey ngthuong45 epireve mavteam namph-sgn bmwas sids07 cchengz solyarisoftware oguzhankarahan cindymuji bellaboga peune aresa7796 vanessadourado lossoh chenchela belensantamaria ai-natural-language-processing-lab hiiamsid tienhoang1994 duanzhihua msg4rajesh gauravgitgithub escit king-luffy hussainali27 world-classic rspinkal ponder-lab alexrogalskiy greg-house42 ratul47 naveen2507 julep-ai bengt prd-tu-tran bryanchance ayush27112021 jean-london vitaly-z axxessio mastersatish mammhoud budirs86 amirhosein-darmani mfcochauxlaberge ehzawad

rasa-nlu-examples's Issues

FlashTextEntityExtractor

Regex can be slow. Instead, it might help to have an entity extractor that is based on flashtext. This can really help for long name-lists for example.

klpt

kurdish https://www.aclweb.org/anthology/2020.nlposs-1.11.pdf

Support 2.0

Are you going to adapt examples to 2.0 version?

Add a ML model that works on Sparse Data

It might make the tests a whole lot faster but right now SklearnIntentClassifier only works with dense features.

ButtonConfirmAction

It might be nice to open source an action that, given a threshold of uncertainty, asks the user with buttons which intent was actually queried.

snowballstemmer

I've just been told it might work well for Turkish.

https://pypi.org/project/snowballstemmer/

Missing `prepare_everything.py`

The make install command ends with

python tests/prepare_everything.py
python: can't open file 'tests/prepare_everything.py': [Errno 2] No such file or directory
make: *** [Makefile:4: install] Error 2

Also, the documentation refers to this file, but it doesn't exist.

Remove occurrences of `token_pattern` parameter from `CountVectorFeaturizer` config

As per the discussion on Slack here (summary: we've decided against exposing token_pattern and use analyzer instead) we should remove these occurrences and replace them with the appropriate analyzer argument (found in benchmarking.md and readme.md)

Tensorflow error while running benchmarking guide

Hi @koaning

I follow benchmarking guideline here
https://rasahq.github.io/rasa-nlu-examples/benchmarking/

but found this error


(binus) Wellys-MacBook-Pro:rasa-demo wellytambunan$ rasa test nlu --config basic-bytepair.config.yml           --cross-validation --runs 1 --folds 2           --out gridresults/basic-bytepair-config
2020-08-14 10:22:31 INFO     rasa.cli.test  - Test model using cross validation.
Traceback (most recent call last):
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/bin/rasa", line 8, in <module>
    sys.exit(main())
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/__main__.py", line 92, in main
    cmdline_arguments.func(cmdline_arguments)
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/cli/test.py", line 147, in run_nlu_test
    perform_nlu_cross_validation(config, nlu_data, output, vars(args))
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/test.py", line 243, in perform_nlu_cross_validation
    data, folds, nlu_config, output, **kwargs
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/test.py", line 1354, in cross_validate
    trainer = Trainer(nlu_config)
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/model.py", line 142, in __init__
    components.validate_requirements(cfg.component_names)
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/components.py", line 46, in validate_requirements
    from rasa.nlu import registry
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/registry.py", line 13, in <module>
    from rasa.nlu.classifiers.diet_classifier import DIETClassifier
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 9, in <module>
    import tensorflow_addons as tfa
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/__init__.py", line 21, in <module>
    from tensorflow_addons import activations
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/activations/__init__.py", line 21, in <module>
    from tensorflow_addons.activations.gelu import gelu
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/activations/gelu.py", line 24, in <module>
    get_path_to_datafile("custom_ops/activations/_activation_ops.so"))
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 58, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
**tensorflow.python.framework.errors_impl.NotFoundError:** dlopen(/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/custom_ops/activations/_activation_ops.so, 6): **Symbol not found:** __ZN10tensorflow11GetNodeAttrERKNS_9AttrSliceEN4absl11string_viewEPb
  Referenced from: /Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/custom_ops/activations/_activation_ops.so
  Expected in: /Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.2.dylib
 in /Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/custom_ops/activations/_activation_ops.so
(binus) Wellys-MacBook-Pro:rasa-demo wellytambunan$

deploy rasa-nlu-examples with docker

hi every one,
I trained a new just-nlu-model with my own w2v model. now, i want to put it in production/staging Env.
but official version of rasa docker not support this project. can i ask you show me how create a docker file from official rasa docker to use my own w2v model?

Investigate Polyglot

It might be worth it to also investigate this library. https://polyglot.readthedocs.io/en/latest/NamedEntityRecognition.html

Arabic Tokenizer

https://github.com/ARBML/tkseem

Standardise Ways to Share Results

We'd love it if people could let us know what tools help them. With that in mind, it would be good to consider a page on the documentation where we can list results. It'd probably be best to seperate the results per language per project?

This thread is to collect ideas on this topic.

Printer object needs to be configurable.

I've noticed that we might only be interested in printing information about certain properties in the pipeline sofar. It'd be nice if we could configure the component to allow for that.

missing GensimFeaturizer module after pip install git+

I put Gensimfeaturizer in rasa nlu pipeline and got this message:

Exception: Failed to find class 'GensimFeaturizer' in module 'rasa_nlu_examples.featurizers.dense'.

After try uninstall and install with pip git and python pip git, clone master branch with last commit and ... but the problem not solved.
finally, i found that the class .py file not present so i copy/paste gensim_featurizer.py in package installed folder(/rasa_nlu_example/featurizer/dense) and it did work!

Indic NLP

Something to keep an eye on:

https://anoopkunchukuttan.github.io/indic_nlp_library/

If we receive feedback that this tool could be useful to add here, we should.

add support for fasttext language detection

Fasttext has a tool for this. Might be nice to check if we can use it in a component.

Could it trigger a RulePolicy?

Printer should use Rich

The printer should print pretty objects. The rich library might be able to make a difference. It might make the output a lot clearer.

Before an implementation happens a path forward should be discussed here. Are we going to use tables? Colors?

Speed Up Tests

Currently we're using the cli runner to run smoke-tests. This is a good idea but it is incredibly slow. It has to close down and load up tensorflow at every pass and this is slowing down the CI on github.

It's probably much better to instead remove the cli and instead run the command that is triggered from the cli instead from within python.

Turkish NLU data

I'd like to add Turkish NLU data if there isn't anyone else doing it at the moment.

SparseSpacyFeaturizer

If you have a look at all the attributes that spaCy generates for their tokens then you can imagine that some of these features can be useful for machine learning pipelines. To name a few:

is_oov: is the token part of the vocabulary/does it have a vector?
is_stop: is the token a stopword?
lemma_: what is the lemma of the token
pos/tag coarse/fine-grained part of speech information
morphological features
grammatical dependency

These can all have a discrete representation and could be added in general to a Rasa pipeline.

Fix printer intents

Issue described in detail here.

aravec

I should add a guide on how to use these embeddings:

https://github.com/bakrianoo/aravec

Benchmarking Results

We want to add a portion to the documentation where people can share some results of the tools in this library. We'll gladly link to any blogpost/github project as well but we'd love to hear/list in which scenarios our tools make a difference.

Anybody with results is free to ping @koaning here.

BlankSpacyTokenizer

Spacy has lots of blank tokenizers. They are rule-based, and therefore they should also technically have features that our simple white-space tokenizer doesn't.

CustomPythonComponent

Goal

To make hacking around easier for our research department, it might make sense to have a component that can just apply a function on a message.

The idea is that you as a user can define a file, say custom_component.py like;

# custom_component.py
from rasa_nlu_examples.meta import CustomPythonComponent
model = load_fasttext()

def fasttext(message, setting_a=1, setting_b=2):
    """this is pseudocode"""
    model.process(message) 
    return message # this message now has extra features attached

MyFastTextTool = CustomPythonComponent(fasttext, setting_a=1, setting_b=2)

Once such a file is around your project, it'd be cool if you could do;

language: en

pipeline:
- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: rasa_nlu_examples.meta.Printer
  alias: before count vectors
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: rasa_nlu_examples.meta.Printer
  alias: after count vectors
- name: custom_component.MyFastTextTool
  setting_a: 1
  setting_b: 2
- name: DIETClassifier
  epochs: 100

This should make it much easier to add a featurizer. We shouldn't implement internal tools with it, but this might allow for some experimentation with actual rasa tools as opposed to jupyter notebooks.

Add Sentencepiece Tokeniser support

SentencePiece is generally used to create byte pairs in any language, as I can find there is no inbuilt support for this kind of tokenisation in rasa. Even though this library uses BPEmb but it is only limited for pretrained embeddings and not tokenisation, since Whitespace tokeniser doesn't always perform good, i would like to have support for it. I am willing to do PR for this, but I don't know about the contribution steps here.

Add LogisticRegression

I'd like to add another intent classification model here. Similar to the naive bayes model.

spaCy and numerical modifiers

Might be good to map numeric modifiers to entities.

https://forum.rasa.com/t/how-to-map-multiple-items-and-their-quantities-rasa-x/38571/11

fix docs link

Stanza currently points to Thai.

Feature Request: Gensim Key-Value Paris

Gensim probably offers the easier way to train your own embeddings which might allow for users to use their own if they have a corpus that is reliable. I've understood that wikipedia is not a reliable source of online-slang for many languages.

NLU stopwords

The Rasa Countvectorizer currently has a stopwords attribute, but it unfortunately only works for the sparse features. Any word embeddings that belong to a stopword are still generated. It might make sense to build a component that actually removes the stopwords from the message before it is handled by other components.

fugashi

japanese tools https://www.aclweb.org/anthology/2020.nlposs-1.7/

Dependency Management

It might be better to make it the users' responsibility to handle the dependencies and now have them install immediately via pip. I don't think we can use the [all] syntax when we're using github but we could explore that.

In any case, we really don't want folks to download the thai-language tools or pytorch if they're only using the bytepair embeddings.

Vietnamese Tokenizer

We might be able to fetch a tokenizer from here and make it compatible for Rasa.

RasaClassifier is currently influenced by `ranking_length`.

As reported RasaHQ/rasalit#37, it's something that needs fixing.

Adding Stanza

In places where FastText wrapped into spaCy is no use, Stanza comes in handy - it can give us the necessary POS-es and lemmatization. It is, at least, the case for Estonian. Should be also for Finnish, Hebrew, Hindi, Hungarian, Indonesian, Irish, Korean, Latvian, Persian, Telugu, Urdu, etc.

Non-English Namelists

This is a response to this question and this one. It seems clear that a pre-trained French model can't be expected to detect names that aren't from France very well. The questions on the forum are about French but I can certainly imagine this issue also being relevant for other languages too.

So as a next-best-idea. We should host common names per country somewhere so folks might use it as a lookup for Regex. This needs to be a community effort but it can be very helpful.

Tokenizers for Less Common Languages

The whitespace tokenizer in Rasa is focussed on western languages. If there are languages who appreciate a different tokenizer then we might explore alternatives in this thread.

read paper CAMEL

https://www.aclweb.org/anthology/2020.lrec-1.868/

Migrate from Gensim 3 to Gensim 4

There's a guide here: https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

Experiment with `padatious`

There was a package suggested here that might be worth exploring. It is suggested that it works really well on very small datasets. The project can be found here.

Investigate CLTK

I found this project on github and it might offer tools interesting for, among other languages, Hindi. I do not know if the tokenizers are of high quality but they are documented here.

zemberek

From Tensorflow Turkey meetup. Let's investigate if we can add it here.

Unclear error message when file is not found

Currently when the file is not found at the specified cache_dir in the config.yml file, the error message is very opaque:

ValueError: /path/to/file/wiki.es.bin cannot be opened for loading!

The problem was that the file wiki.es.bin does not exist at /path/to/file, though this is not at all obvious from the error message.

aravec

I should add a guide on how to use these embeddings:

https://github.com/bakrianoo/aravec

Doc2Vec instead of Word2Vec model for Gensim featurizer?

Wondering if doc2vec, instead of word2vec, model can be trained and deployed to Gensim featurizer?

Also if HFTransformer is chosen as the Language Model, thus the corresponding tokenizer and featurizer are used in NLU pipeline, can the Gensim featurizer can still be added to the pipeline to improve domain specific processing?

Thanks.

ComponentNotFoundException: Failed to load the component 'rasa_nlu_examples.featurizers.dense.BytePairFeaturizer'. Failed to find module 'rasa_nlu_examples.featurizers.dense'. Either your pipeline configuration contains an error or the module you are trying to import is broken (e.g. the module is trying to import a package that is not installed). Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/rasa/nlu/registry.py", line 121, in get_component_class
    return rasa.shared.utils.common.class_from_module_path(component_name)
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/rasa/shared/utils/common.py", line 20, in class_from_module_path
    m = importlib.import_module(module_name)
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/rasa_nlu_examples/featurizers/dense/__init__.py", line 1, in <module>
    from .fasttext_featurizer import FastTextFeaturizer
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/rasa_nlu_examples/featurizers/dense/fasttext_featurizer.py", line 5, in <module>
    import fasttext
ModuleNotFoundError: No module named 'fasttext'

Error: Process completed with exit code 1.

rasahq / rasa-nlu-examples Goto Github PK

rasa-nlu-examples's Introduction

Rasa NLU Examples

Install

Documentation

Compatibility

Tokenizers

Featurizers

Intent Classifiers

Entity Extractors

Name Lists

Usage

Open an Issue

Contribute

Documentation

rasa-nlu-examples's People

Contributors

Stargazers

Watchers

Forkers

rasa-nlu-examples's Issues

Recommend Projects

Recommend Topics

Recommend Org