Giter Site home page Giter Site logo

rasahq / rasa-nlu-examples Goto Github PK

View Code? Open in Web Editor NEW
189.0 14.0 77.0 3.79 MB

This repository contains examples of custom components for educational purposes.

Home Page: https://RasaHQ.github.io/rasa-nlu-examples/

License: Apache License 2.0

Python 99.62% Makefile 0.38%
rasa rasa-nlu

rasa-nlu-examples's Introduction

Rasa NLU Examples

This repository contains Rasa compatible machine learning components. These components are open sourced in order to encourage experimentation and to quickly offer support to more tools. By hosting these components here they do not need to go through the same vetting process as the components in Rasa and we hope that this makes it easier for people to contribute new ideas.

The components in the repository are not officially supported. There will be units tests as well as documentation but this project should be considered a community project, not something that is part of core Rasa. If there's a component here that turns out to be useful to the larger Rasa community then we might port features from this repository to Rasa.

Install

To use these tools locally you need to install via git.

python -m pip install "rasa_nlu_examples @ git+https://github.com/RasaHQ/rasa-nlu-examples.git"

Note that if you want to install optional dependencies as well that you'll need to run:

python -m pip install "rasa_nlu_examples[flashtext] @ git+https://github.com/RasaHQ/rasa-nlu-examples.git"
python -m pip install "rasa_nlu_examples[dateparser] @ git+https://github.com/RasaHQ/rasa-nlu-examples.git"
python -m pip install "rasa_nlu_examples[thai] @ git+https://github.com/RasaHQ/rasa-nlu-examples.git"
python -m pip install "rasa_nlu_examples[fasttext] @ git+https://github.com/RasaHQ/rasa-nlu-examples.git"
python -m pip install "rasa_nlu_examples[all] @ git+https://github.com/RasaHQ/rasa-nlu-examples.git"

If you're using any models that depend on spaCy you'll need to install the Rasa dependencies for spaCy first.

python -m pip install rasa[spacy]

Documentation

You can find the documentation for this project here.

Compatibility

This project currently supports components for Rasa 3.0. For older versions, see the list below.

Tokenizers

Tokenizers can split up the input text into tokens. Depending on the Tokenizer that you pick you can also choose to apply lemmatization. For languages that have rich grammatical features this might help reduce the size of all the possible tokens.

  • rasa_nlu_examples.tokenizers.BlankSpacyTokenizer docs
  • rasa_nlu_examples.tokenizers.ThaiTokenizer docs

Featurizers

Dense featurizers attach dense numeric features per token as well as to the entire utterance. These features are picked up by intent classifiers and entity detectors later in the pipeline.

  • rasa_nlu_examples.featurizers.dense.FastTextFeaturizer docs
  • rasa_nlu_examples.featurizers.dense.BytePairFeaturizer docs
  • rasa_nlu_examples.featurizers.dense.GensimFeaturizer docs
  • rasa_nlu_examples.featurizers.sparse.TfIdfFeaturizer docs
  • rasa_nlu_examples.featurizers.sparse.HashingFeaturizer docs

Intent Classifiers

Intent classifiers are models that predict an intent from a given user message text. The default intent classifier in Rasa NLU is the DIET model which can be fairly computationally expensive, especially if you do not need to detect entities. We provide some examples of alternative intent classifiers here.

  • rasa_nlu_examples.classifiers.NaiveBayesClassifier docs
  • rasa_nlu_examples.classifiers.LogisticRegressionClassifier docs

Entity Extractors

  • rasa_nlu_examples.extractor.FlashTextEntityExtractor docs
  • rasa_nlu_examples.extractor.DateparserEntityExtractor docs

Name Lists

Language models are typically trained on Western datasets. That means that the reported benchmarks might not apply to your use-case. For example; detecting names in texts from France is not the same thing as detecting names in Madagascar. Even though French is used actively in both countries, the names of it's citizens might be so different that you cannot assume that the benchmarks apply universally.

To remedy this we've started collecting name lists. These can be used as a lookup table which can be picked up by Rasa's RegexEntityExtractor or our FlashTextEntityExtractor. It won't be 100% perfect but it should give a reasonable starting point.

You can find the namelists here. We currently offer namelists for the United States, Germany as well as common Arabic names. Feel free to submit PRs for more languages. We're also eager to receive feedback.

Usage

You can install the examples from this repo via pip. Let's say you download the standard tools.

pip install git+https://github.com/RasaHQ/rasa-nlu-examples

Once installed you can add tools to your config.yml file, here's an example;

language: en
pipeline:
- name: WhitespaceTokenizer
- name: CountVectorsFeaturizer
  OOV_token: oov.txt
  analyzer: word
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer
  lang: en
  vs: 1000
  dim: 25
- name: DIETClassifier
  epochs: 200

An example config for using the Thai tokenizer would look like:

language: th
pipeline:
  - name: rasa_nlu_examples.tokenizers.ThaiTokenizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 200

And you can use this file to run benchmarks. From the root folder of the project typically that means running something like;

rasa test nlu --config basic-bytepair-config.yml \
          --cross-validation --runs 1 --folds 2 \
          --out gridresults/basic-bytepair-config

Open an Issue

If you've spotted a bug then you can submit an issue here. GitHub issues allow us to keep track of a conversation about this repository and it is the preferred communication channel for bugs related to this project.

Contribute

There are many ways you can contribute to this project.

  • You can suggest new features.
  • You can let us know if there are bugs.
  • You can share the results of an experiment you ran using these tools.
  • You can let us know if the components in this library help you.

Feel free to start the discussion by opening an issue on this repository. Before submitting code to the repository it would help if you first create an issue so that the maintainers can disucss the changes you would like to contribute. A more in-depth contribution guide can be found here.

To get started locally you can run:

python -m pip install -e ".[dev]"
pre-commit install
python tests/scripts/prepare_fasttext.py

Alternatively you may also run this via the Makefile:

make install

Documentation

If you want to build the documentation locally you can do so via;

mkdocs serve

If you want to deplot the docs to GitHub you can run;

mkdocs gh-deploy

This will deploy a new version to the docs branch, which is picked up by GitHub pages.

rasa-nlu-examples's People

Contributors

armanjindal avatar belensantamaria avatar imanearaf avatar juliangerhard21 avatar ka-bu avatar kedz avatar koaning avatar koernerfelicia avatar merveenoyan avatar mleimeister avatar sids07 avatar solyarisoftware avatar souvikg10 avatar tabergma avatar tttthomasssss avatar twolodzko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rasa-nlu-examples's Issues

FlashTextEntityExtractor

Regex can be slow. Instead, it might help to have an entity extractor that is based on flashtext. This can really help for long name-lists for example.

Support 2.0

Are you going to adapt examples to 2.0 version?

ButtonConfirmAction

It might be nice to open source an action that, given a threshold of uncertainty, asks the user with buttons which intent was actually queried.

Missing `prepare_everything.py`

The make install command ends with

python tests/prepare_everything.py
python: can't open file 'tests/prepare_everything.py': [Errno 2] No such file or directory
make: *** [Makefile:4: install] Error 2

Also, the documentation refers to this file, but it doesn't exist.

Tensorflow error while running benchmarking guide

Hi @koaning

I follow benchmarking guideline here
https://rasahq.github.io/rasa-nlu-examples/benchmarking/

but found this error


(binus) Wellys-MacBook-Pro:rasa-demo wellytambunan$ rasa test nlu --config basic-bytepair.config.yml           --cross-validation --runs 1 --folds 2           --out gridresults/basic-bytepair-config
2020-08-14 10:22:31 INFO     rasa.cli.test  - Test model using cross validation.
Traceback (most recent call last):
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/bin/rasa", line 8, in <module>
    sys.exit(main())
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/__main__.py", line 92, in main
    cmdline_arguments.func(cmdline_arguments)
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/cli/test.py", line 147, in run_nlu_test
    perform_nlu_cross_validation(config, nlu_data, output, vars(args))
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/test.py", line 243, in perform_nlu_cross_validation
    data, folds, nlu_config, output, **kwargs
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/test.py", line 1354, in cross_validate
    trainer = Trainer(nlu_config)
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/model.py", line 142, in __init__
    components.validate_requirements(cfg.component_names)
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/components.py", line 46, in validate_requirements
    from rasa.nlu import registry
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/registry.py", line 13, in <module>
    from rasa.nlu.classifiers.diet_classifier import DIETClassifier
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 9, in <module>
    import tensorflow_addons as tfa
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/__init__.py", line 21, in <module>
    from tensorflow_addons import activations
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/activations/__init__.py", line 21, in <module>
    from tensorflow_addons.activations.gelu import gelu
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/activations/gelu.py", line 24, in <module>
    get_path_to_datafile("custom_ops/activations/_activation_ops.so"))
  File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 58, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
**tensorflow.python.framework.errors_impl.NotFoundError:** dlopen(/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/custom_ops/activations/_activation_ops.so, 6): **Symbol not found:** __ZN10tensorflow11GetNodeAttrERKNS_9AttrSliceEN4absl11string_viewEPb
  Referenced from: /Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/custom_ops/activations/_activation_ops.so
  Expected in: /Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.2.dylib
 in /Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/custom_ops/activations/_activation_ops.so
(binus) Wellys-MacBook-Pro:rasa-demo wellytambunan$ 

deploy rasa-nlu-examples with docker

hi every one,
I trained a new just-nlu-model with my own w2v model. now, i want to put it in production/staging Env.
but official version of rasa docker not support this project. can i ask you show me how create a docker file from official rasa docker to use my own w2v model?

Standardise Ways to Share Results

We'd love it if people could let us know what tools help them. With that in mind, it would be good to consider a page on the documentation where we can list results. It'd probably be best to seperate the results per language per project?

This thread is to collect ideas on this topic.

missing GensimFeaturizer module after pip install git+

I put Gensimfeaturizer in rasa nlu pipeline and got this message:

Exception: Failed to find class 'GensimFeaturizer' in module 'rasa_nlu_examples.featurizers.dense'.

After try uninstall and install with pip git and python pip git, clone master branch with last commit and ... but the problem not solved.
finally, i found that the class .py file not present so i copy/paste gensim_featurizer.py in package installed folder(/rasa_nlu_example/featurizer/dense) and it did work!

Printer should use Rich

The printer should print pretty objects. The rich library might be able to make a difference. It might make the output a lot clearer.

Before an implementation happens a path forward should be discussed here. Are we going to use tables? Colors?

Speed Up Tests

Currently we're using the cli runner to run smoke-tests. This is a good idea but it is incredibly slow. It has to close down and load up tensorflow at every pass and this is slowing down the CI on github.

It's probably much better to instead remove the cli and instead run the command that is triggered from the cli instead from within python.

Turkish NLU data

I'd like to add Turkish NLU data if there isn't anyone else doing it at the moment.

SparseSpacyFeaturizer

If you have a look at all the attributes that spaCy generates for their tokens then you can imagine that some of these features can be useful for machine learning pipelines. To name a few:

  • is_oov: is the token part of the vocabulary/does it have a vector?
  • is_stop: is the token a stopword?
  • lemma_: what is the lemma of the token
  • pos/tag coarse/fine-grained part of speech information
  • morphological features
  • grammatical dependency

These can all have a discrete representation and could be added in general to a Rasa pipeline.

Benchmarking Results

We want to add a portion to the documentation where people can share some results of the tools in this library. We'll gladly link to any blogpost/github project as well but we'd love to hear/list in which scenarios our tools make a difference.

Anybody with results is free to ping @koaning here.

BlankSpacyTokenizer

Spacy has lots of blank tokenizers. They are rule-based, and therefore they should also technically have features that our simple white-space tokenizer doesn't.

CustomPythonComponent

Goal

To make hacking around easier for our research department, it might make sense to have a component that can just apply a function on a message.

The idea is that you as a user can define a file, say custom_component.py like;

# custom_component.py
from rasa_nlu_examples.meta import CustomPythonComponent
model = load_fasttext()

def fasttext(message, setting_a=1, setting_b=2):
    """this is pseudocode"""
    model.process(message) 
    return message # this message now has extra features attached

MyFastTextTool = CustomPythonComponent(fasttext, setting_a=1, setting_b=2)

Once such a file is around your project, it'd be cool if you could do;

language: en

pipeline:
- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: rasa_nlu_examples.meta.Printer
  alias: before count vectors
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: rasa_nlu_examples.meta.Printer
  alias: after count vectors
- name: custom_component.MyFastTextTool
  setting_a: 1
  setting_b: 2
- name: DIETClassifier
  epochs: 100

This should make it much easier to add a featurizer. We shouldn't implement internal tools with it, but this might allow for some experimentation with actual rasa tools as opposed to jupyter notebooks.

Add Sentencepiece Tokeniser support

SentencePiece is generally used to create byte pairs in any language, as I can find there is no inbuilt support for this kind of tokenisation in rasa. Even though this library uses BPEmb but it is only limited for pretrained embeddings and not tokenisation, since Whitespace tokeniser doesn't always perform good, i would like to have support for it. I am willing to do PR for this, but I don't know about the contribution steps here.

Add LogisticRegression

I'd like to add another intent classification model here. Similar to the naive bayes model.

Feature Request: Gensim Key-Value Paris

Gensim probably offers the easier way to train your own embeddings which might allow for users to use their own if they have a corpus that is reliable. I've understood that wikipedia is not a reliable source of online-slang for many languages.

NLU stopwords

The Rasa Countvectorizer currently has a stopwords attribute, but it unfortunately only works for the sparse features. Any word embeddings that belong to a stopword are still generated. It might make sense to build a component that actually removes the stopwords from the message before it is handled by other components.

Dependency Management

It might be better to make it the users' responsibility to handle the dependencies and now have them install immediately via pip. I don't think we can use the [all] syntax when we're using github but we could explore that.

In any case, we really don't want folks to download the thai-language tools or pytorch if they're only using the bytepair embeddings.

Adding Stanza

In places where FastText wrapped into spaCy is no use, Stanza comes in handy - it can give us the necessary POS-es and lemmatization. It is, at least, the case for Estonian. Should be also for Finnish, Hebrew, Hindi, Hungarian, Indonesian, Irish, Korean, Latvian, Persian, Telugu, Urdu, etc.

Non-English Namelists

This is a response to this question and this one. It seems clear that a pre-trained French model can't be expected to detect names that aren't from France very well. The questions on the forum are about French but I can certainly imagine this issue also being relevant for other languages too.

So as a next-best-idea. We should host common names per country somewhere so folks might use it as a lookup for Regex. This needs to be a community effort but it can be very helpful.

Tokenizers for Less Common Languages

The whitespace tokenizer in Rasa is focussed on western languages. If there are languages who appreciate a different tokenizer then we might explore alternatives in this thread.

Experiment with `padatious`

There was a package suggested here that might be worth exploring. It is suggested that it works really well on very small datasets. The project can be found here.

Investigate CLTK

I found this project on github and it might offer tools interesting for, among other languages, Hindi. I do not know if the tokenizers are of high quality but they are documented here.

zemberek

From Tensorflow Turkey meetup. Let's investigate if we can add it here.

Unclear error message when file is not found

Currently when the file is not found at the specified cache_dir in the config.yml file, the error message is very opaque:

ValueError: /path/to/file/wiki.es.bin cannot be opened for loading!

The problem was that the file wiki.es.bin does not exist at /path/to/file, though this is not at all obvious from the error message.

Doc2Vec instead of Word2Vec model for Gensim featurizer?

Wondering if doc2vec, instead of word2vec, model can be trained and deployed to Gensim featurizer?

Also if HFTransformer is chosen as the Language Model, thus the corresponding tokenizer and featurizer are used in NLU pipeline, can the Gensim featurizer can still be added to the pipeline to improve domain specific processing?

Thanks.

Feature Request: spaCy POS tags

A lot of entities will be nouns. Even if we don't use spaCy as an entity detection engine, it does come with useful language detection features that might be useful for our pipeline. Would be worth an experiment.

Github Workflow does not work with BytePairFeaturizer anymore, because FastTextFeaturizer can't be found

Hey there, I am using a CI/CD pipeline on github for a while with installing rasa nlu examples, training and testing the model.
It worked without any problems.
Today the workflow fails and i get this error:

ComponentNotFoundException: Failed to load the component 'rasa_nlu_examples.featurizers.dense.BytePairFeaturizer'. Failed to find module 'rasa_nlu_examples.featurizers.dense'. Either your pipeline configuration contains an error or the module you are trying to import is broken (e.g. the module is trying to import a package that is not installed). Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/rasa/nlu/registry.py", line 121, in get_component_class
    return rasa.shared.utils.common.class_from_module_path(component_name)
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/rasa/shared/utils/common.py", line 20, in class_from_module_path
    m = importlib.import_module(module_name)
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/rasa_nlu_examples/featurizers/dense/__init__.py", line 1, in <module>
    from .fasttext_featurizer import FastTextFeaturizer
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/rasa_nlu_examples/featurizers/dense/fasttext_featurizer.py", line 5, in <module>
    import fasttext
ModuleNotFoundError: No module named 'fasttext'

Error: Process completed with exit code 1.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.