Giter Site home page Giter Site logo

dataiku / dss-plugin-nlp-preparation Goto Github PK

View Code? Open in Web Editor NEW
23.0 22.0 8.0 18.36 MB

Dataiku DSS plugin to detect languages, correct misspellings, and clean text data 🧼

Home Page: https://www.dataiku.com/product/plugins/nlp-preparation/

License: Apache License 2.0

Makefile 2.39% Python 97.54% Dockerfile 0.07%
dataiku dss-plugin nlp natural-language-processing auto-correct spell-checker language-detection language-identification text-cleaning

dss-plugin-nlp-preparation's Introduction

dss-plugin-nlp-preparation's People

Contributors

alexcombessie avatar alexlandeau avatar damienjacquemart avatar mhham avatar muennighoff avatar stanislasguinel avatar tdesfont avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dss-plugin-nlp-preparation's Issues

Supported Languages in Text Cleaning are not accepted

Describe the bug
After applying the language detection step, I tried to clean the text and get token information like number, symbols, count, etc., but unfortunately, I can't apply the Text Cleaning step as i keep running into the error that it found unsupported languages, which are supported in the documentation.

To Reproduce
Steps to reproduce the behavior:

  1. Apply Language Detection step to get ISO 639-1 language code
  2. Apply the Text Cleaning step to supported languages (ex. english, german, etc.)
  3. See error

Expected behavior
The text cleanup to applied without issues.

Screenshots
Screenshots are provided with example data and plugin configuration.
data_example
text_cleaning_configuration
error_log

Additional context

  • DSS version 10.0.4

TypeError: Cannot cast array data from dtype('O') to dtype('bool') according to the rule 'safe'

Hi,

I was running the recipe/plugin in dataiku and encountered the above error.
below is an extract of the traceback.


*************** Recipe code failed **************
[09:50:49] [INFO] [dku.utils] - Begin Python stack
[09:50:49] [INFO] [dku.utils] - Traceback (most recent call last):
[09:50:49] [INFO] [dku.utils] - File "pandas/_libs/parsers.pyx", line 1156, in pandas._libs.parsers.TextReader._convert_tokens
[09:50:49] [INFO] [dku.utils] - TypeError: Cannot cast array data from dtype('O') to dtype('bool') according to the rule 'safe'
[09:50:49] [INFO] [dku.utils] - During handling of the above exception, another exception occurred:
[09:50:49] [INFO] [dku.utils] - Traceback (most recent call last):
[09:50:49] [INFO] [dku.utils] - File "/home/dataiku/dss/jobs/compute_Responses_Lemmatize_NP/custom-python-recipe/pyoutAqiJYccVwKv8/python-exec-wrapper.py", line 208, in
[09:50:49] [INFO] [dku.utils] - exec(f.read())
[09:50:49] [INFO] [dku.utils] - File "", line 27, in
[09:50:49] [INFO] [dku.utils] - File "/home/dataiku/dss/plugins/installed/nlp-preparation/python-lib/dku_io_utils.py", line 79, in process_dataset_chunks
[09:50:49] [INFO] [dku.utils] - for i, df in tqdm(enumerate(df_iterator), total=len_iterator, unit="chunk", mininterval=1.0):
[09:50:49] [INFO] [dku.utils] - File "/home/dataiku/dss/code-envs/python/plugin_nlp-preparation_managed/lib/python3.6/site-packages/tqdm/std.py", line 1178, in iter
[09:50:49] [INFO] [dku.utils] - for obj in iterable:
[09:50:49] [INFO] [dku.utils] - File "/home/dataiku/dataiku-dss-9.0.4/python/dataiku/core/dataset.py", line 611, in iter_dataframes
[09:50:49] [INFO] [dku.utils] - for df in df_it:
[09:50:49] [INFO] [dku.utils] - File "/home/dataiku/dss/code-envs/python/plugin_nlp-preparation_managed/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1007, in next
[09:50:49] [INFO] [dku.utils] - return self.get_chunk()
[09:50:49] [INFO] [dku.utils] - File "/home/dataiku/dss/code-envs/python/plugin_nlp-preparation_managed/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1070, in get_chunk
[09:50:49] [INFO] [dku.utils] - return self.read(nrows=size)
[09:50:49] [INFO] [dku.utils] - File "/home/dataiku/dss/code-envs/python/plugin_nlp-preparation_managed/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
[09:50:49] [INFO] [dku.utils] - ret = self._engine.read(nrows)
[09:50:49] [INFO] [dku.utils] - File "/home/dataiku/dss/code-envs/python/plugin_nlp-preparation_managed/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
[09:50:49] [INFO] [dku.utils] - data = self._reader.read(nrows)
[09:50:49] [INFO] [dku.utils] - File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
[09:50:49] [INFO] [dku.utils] - File "pandas/_libs/parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
[09:50:49] [INFO] [dku.utils] - File "pandas/_libs/parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
[09:50:49] [INFO] [dku.utils] - File "pandas/_libs/parsers.pyx", line 1094, in pandas._libs.parsers.TextReader._convert_column_data
[09:50:49] [INFO] [dku.utils] - File "pandas/_libs/parsers.pyx", line 1164, in pandas._libs.parsers.TextReader._convert_tokens
[09:50:49] [INFO] [dku.utils] - ValueError: cannot safely convert passed user dtype of bool for object dtyped data in column 32


this happened for spell checking and text cleaning.
Hope you could shed some light on this.

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.