neomatrix369 / nlp_profiler Goto Github PK

A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.

License: Other

Python 90.19% Shell 9.81%

nlp nlp-library nlp-parsing nlp-keywords-extraction nlp-machine-learning text-mining natural-language-processing nlp-profiler kaggle-kernels google-colab

nlp_profiler's Introduction

NLP Profiler

||| |||

A simple NLP library that allows profiling datasets with one or more text columns.

NLP Profiler returns either high-level insights or low-level/granular statistical information about the text when given a dataset and a column name containing text data, in that column.

In short: Think of it as using the pandas.describe() function or running Pandas Profiling on your data frame, but for datasets containing text columns rather than the usual columnar datasets.

Community/Chat/Communication:
What do you get from the library?
Requirements
Getting started
Notebooks
Screenshots
Credits and supporters
Changes
License
Contributing

What do you get from the library?

Input a Pandas dataframe series as an input parameter.
You get back a new dataframe with various features about the parsed text per row.
- High-level: sentiment analysis, objectivity/subjectivity analysis, spelling quality check, grammar quality check, ease of readability check, etc...
- Low-level/granular: number of characters in the sentence, number of words, number of emojis, number of words, etc...
From the above numerical data in the resulting dataframe descriptive statistics can be drawn using the pandas.describe() on the dataframe.

See screenshots under the Jupyter section and also under Screenshots for further illustrations.

Under the hood it does make use of a number of libraries that are popular in the AI and ML communities, but we can extend it's functionality by replacing or adding other libraries as well.

A simple notebook have been provided to illustrate the usage of the library.

Please join the Gitter.im community and say "hello" to us, share your feedback, have a fun time with us.

Note: this is a new endeavour and it may have rough edges i.e. NLP_Profiler in its current version is probably NOT capable of doing many things. Many of these gaps are opportunities we can work on and plug, as we go along using it. Please provide constructive feedback to help with the improvement of this library. We just recently achieved this with scaling with larger datasets.

Requirements

Python 3.7.x or higher.
Dependencies described in the requirements.txt.
High-level including Grammar checks:
- faster processor
- higher RAM capacity
- working disk-space of 1 to 3 GBytes (depending on the dataset size)
(Optional)
- Jupyter Lab (on your local machine).
- Google Colab account.
- Kaggle account.
- Grammar check functionality:
  - Internet access
  - Java 8 or higher

Getting started

Installation

For Conda/Miniconda environments:

conda config --set pip_interop_enabled True
pip install "spacy >= 2.3.0,<3.0.0"         # in case spacy is not present
python -m spacy download en_core_web_sm

### now perform any of the below pathways/options

For Kaggle environments:

pip uninstall typing      # this can cause issues on Kaggle hence removing it helps

Follow any of the remaining installation steps but "avoid" using -U with pip install -- again this can cause issues on Kaggle hence not using it helps.

From PyPi:

pip install -U nlp_profiler

From the GitHub repo:

pip install -U git+https://github.com/neomatrix369/nlp_profiler.git@master

From the source:

For library development purposes, see Developer guide

Usage

import nlp_profiler.core as nlpprof

new_text_column_dataset = nlpprof.apply_text_profiling(dataset, 'text_column')

from nlp_profiler.core import apply_text_profiling

new_text_column_dataset = apply_text_profiling(dataset, 'text_column')

See Notebooks section for further illustrations.

Developer guide

See Developer guide to know how to build, test, and contribute to the library.

Demo and presentations

Look at a short demo of the NLP Profiler library at one of these:

Demo of the NLP Profiler library (Abhishek talks #6)

or you find the rest of the talk here or here for slides

Demo of the NLP Profiler library (NLP Zurich talk)

or you find the rest of the talk here or here for slides

Notebooks

After successful installation of the library, RESTART Jupyter kernels or Google Colab runtimes for the changes to take effect.

See Notebooks for usage and further details.

Screenshots

See Screenshots

Credits and supporters

See CREDITS_AND_SUPPORTERS.md

Changes

See CHANGELOG.md

License

Refer licensing (and warranty) policy.

Contributing

Contributions are Welcome!

Please have a look at the CONTRIBUTING guidelines.

Please share it with the wider community (and get credited for it)!

Go to the NLP page

nlp_profiler's People

Contributors

Stargazers

Watchers

nlp_profiler's Issues

[FEATURE] Automate library release process to GitHub and PyPi

Missing functionality

Currently, the release process (to GitHub and PyPi) is done manually, it's prone to errors, and the two scripts used work best in happy-path use-case scenarios while edge-case even though less to worry about are not taken care of, as well as they could have been.

The release to PyPi should be fail-safe as there is no way to revert if a mistake is made.

Proposed feature

Automate the process and checks and balances:

check if the git state is valid, report invalid state and abort step
Check if version information is tagged/entered into the CHANGELOG.md
- if the __version__ is the same, let the user know it needs to be entered before proceeding
- otherwise, mention the presence and proceed
version checking: compare local version stamp with that on git repo (releases/tags) and warn accordingly
- if the __version__ is the same, let the user know it needs to be incremented before proceeding
- otherwise mention local version and remote version and proceed with the process
version checking: compare local version stamp with that on pypi and warn accordingly
- if the __version__ is the same, let the user know it needs to be incremented before proceeding
- otherwise mention local version and remote version and proceed with the process
synchronise local and remote repo (part of the scripts)
- [x] by running GITHUB_TOKEN=$MY_GITHUB_TOKEN ./release-to-github.sh
ability to delete release and tag from local and remote with a switch in the script
- [ ] add to ./release-to-github.sh
ask the user when running the pypi release script if they REALLY wish to proceed
- [ ] when running ./release-to-pypi.sh
synchronise local and remote repo after releasing (part of the scripts) (already done when it's run the first time)
- [x] by running GITHUB_TOKEN=$MY_GITHUB_TOKEN ./release-to-github.sh

Provide tangible steps or CLI commands when suggesting solutions for the above steps. Also, add messages to suggest next steps for the two scripts when it finishes executing.

Alternatives considered

Manual intervention: perform all the above steps manually using the above checklist as the release process is a highly infrequent process.

Also, libraries like bumpversion could be taken into consideration, depending on how useful and flexible they are.

Improve code quality based on report from @sourcery-ai (generate_features.py)

Describe the limitations

As per the report from @sourcery-ai in #27 (comment) and #34 (comment)

Do the following:

reduce method length
make method(s), modules less complex (i.e. nlp_profiler/granular_features.py and other modules)
improve run-time working memory (maybe use memory-profiling)

Recommendation(s):

Extract out complex expressions
reduce working memory

Add phrase counts or parts-of-speech token counts after extracting entities from a sentence

On the back of the PR #13, it appears there are other types of phrase i.e. pronouns, or dates or organisations etc... - the details can be discussed. So far we have achieved these and there are a number of others to cover:

Name entity recognition features:

Parts of speech features:

See https://spacy.io/api/annotation#section-named-entities and http://www.nltk.org/book/ for details on the above items.

We will replace one or more existing functionalities in the libraries with the above, case-by-case basis. It would be best to group each of them and give them unique names like name-entity-recognition-features and parts-of-speech-features, respectively and club them with granular features.

Both NLTK and Spacey would be used to fulfill these functionalities.

[BUG] Documentation improvement: Correcting spelling mistakes.

This is a great library with very intuitive features.
However, while going through the readme files I found a couple of spelling mistakes and text that can be reframed. If you don't mind I would like to pass it through Grammarly and correct it. Let me know your thoughts. Cheers!

Sentences (in general) are getting an incorrect sentence_count value

During the presentation, it was observed that sentences with emojis could end up getting an incorrect sentence count. And this could be due to the punctuation(s) that build up emojis.

Note: there may be other edge-cases involving the (.) sign which is the primary indicator of the end of a sentence in English (and many such Latin and Germanic languages).

~~Quick solution~~
~~Look for emojis in the text, drop them. Then performance sentence counts on the text~~
~~Cache the respective functions so results can be reused.~~
Better/Robust solution
Use a library or existing algorithm that handle edge-cases better. Maybe nltk or spacey could help in this case.
Cache the respective functions so results can be reused.
Verify/validate
add edge case tests for sentence count
Apply fix to the dependent functionality
Spell check. used it and now it does not help with new less accurate scores, see issue #8

Related

See #13 and #15, the library/method used may help here

After checking multiple examples it seems many sentences were getting incorrect sentence count.

Cant install from pip (from conda environment)

pip install nlp_profiler shows this

$ pip install nlp_profiler
Collecting nlp_profiler
  Using cached nlp_profiler-0.0.2-py2.py3-none-any.whl (39 kB)
Requirement already satisfied: nltk>=3.5 in /home/tyoc213/miniconda3/envs/fastai/lib/python3.8/site-packages (from nlp_profiler) (3.5)
Requirement already satisfied: tqdm>=4.46.0 in /home/tyoc213/miniconda3/envs/fastai/lib/python3.8/site-packages (from nlp_profiler) (4.48.2)
Requirement already satisfied: requests>=2.23.0 in /home/tyoc213/miniconda3/envs/fastai/lib/python3.8/site-packages (from nlp_profiler) (2.24.0)
Requirement already satisfied: ipython>=7.12.0 in /home/tyoc213/miniconda3/envs/fastai/lib/python3.8/site-packages (from nlp_profiler) (7.18.1)
Collecting language-tool-python>=2.3.1
  Using cached language_tool_python-2.4.7-py3-none-any.whl (30 kB)
Requirement already satisfied: pandas in /home/tyoc213/miniconda3/envs/fastai/lib/python3.8/site-packages (from nlp_profiler) (1.1.1)
Collecting swifter>=1.0.3
  Using cached swifter-1.0.7.tar.gz (633 kB)
Collecting textblob>=0.15.3
  Using cached textblob-0.15.3-py2.py3-none-any.whl (636 kB)
ERROR: Could not find a version that satisfies the requirement en-core-web-sm (from nlp_profiler) (from versions: none)
ERROR: No matching distribution found for en-core-web-sm (from nlp_profiler)

To Reproduce
There is no dataframe to share because it cant be installed.

Version information:

Version information is essential in reproducing and resolving bugs. Please report:

Python 3.8.5
conda environment, Ubuntu 5.4.0-56-generic

pip freeze
absl-py==0.11.0
adal==1.2.4
alabaster==0.7.12
appdirs==1.4.4
argon2-cffi==20.1.0
astroid @ file:///tmp/build/80754af9/astroid_1592495912941/work
asttokens==2.0.4
attrs==20.1.0
audioread==2.1.8
azure-cognitiveservices-search-imagesearch==2.0.0
azure-common==1.1.25
Babel==2.8.0
backcall==0.2.0
birdseye==0.8.4
black==20.8b1
bleach==3.1.5
blis==0.4.1
cached-property==1.5.2
catalogue==1.0.0
certifi==2020.6.20
cffi==1.14.2
cfgv==3.2.0
chardet==3.0.4
cheap-repr==0.4.4
click==7.1.2
click-plugins==1.1.1
cligj==0.7.1
colorednoise==1.1.1
commonmark==0.9.1
configparser==5.0.1
coverage==5.2.1
cryptography==3.1
cycler==0.10.0
cymem==2.0.3
dataclasses==0.6
decorator==4.4.2
defusedxml==0.6.0
dill==0.3.3
distlib==0.3.1
docker-pycreds==0.4.0
docutils==0.16
einops==0.3.0
entrypoints==0.3
executing==0.5.3
-e [email protected]:tyoc213-contrib/fastai.git@e1f9d919b41775ddc99eccbf9ac0071c8089762c#egg=fastai
-e [email protected]:tyoc213/fastai_xla_extensions.git@c96e02ee5a31a8012be52c6eac35d6d327e067d1#egg=fastai_xla_extensions
fastbook==0.0.11
-e [email protected]:tyoc213-contrib/fastcore.git@17eb00509e24f4bd91e424b77a0cb7cff7557d1b#egg=fastcore
fastprogress==1.0.0
fastscript==1.0.0
filelock==3.0.12
Fiona==1.8.18
Flask==1.1.2
Flask-Humanize==0.3.0
future==0.18.2
geopandas==0.8.1
gitdb==4.0.5
GitPython==3.1.11
graphviz==0.14.1
heartrate==0.2.1
humanize==3.1.0
identify==1.4.30
idna==2.10
imagesize==1.2.0
iniconfig==1.0.1
ipykernel==5.3.4
ipython==7.18.1
ipython-genutils==0.2.0
ipywidgets==7.5.1
isodate==0.6.0
isort @ file:///tmp/build/80754af9/isort_1598376147378/work
itsdangerous==1.1.0
jedi==0.17.2
Jinja2==2.11.2
joblib==0.16.0
jsonschema==3.2.0
jupyter==1.0.0
jupyter-client==6.1.7
jupyter-console==6.2.0
jupyter-core==4.6.3
jupyter-notebook-gist==0.5.0
kaggle==1.5.9
kiwisolver==1.2.0
lazy-object-proxy==1.4.3
librosa==0.8.0
littleutils==0.2.2
livereload==2.6.3
llvmlite==0.34.0
lunr==0.5.8
Markdown==3.2.2
MarkupSafe==1.1.1
matplotlib==3.3.1
mccabe==0.6.1
memory-profiler==0.58.0
mir-eval==0.6
mistune==0.8.4
mkautodoc==0.1.0
mkdocs==1.1.2
mkdocs-material==5.5.12
mkdocs-material-extensions==1.0
mkl-fft==1.1.0
mkl-random==1.1.1
mkl-service==2.3.0
mknotebooks==0.4.1
more-itertools==8.5.0
msrest==0.6.19
msrestazure==0.6.4
munch==2.5.0
murmurhash==1.0.2
mypy-extensions==0.4.3
nbconvert==5.6.1
-e [email protected]:tyoc213-contrib/nbdev.git@c3281139de18bf17f161a6e621ef7cbcb443cdf1#egg=nbdev
nbformat==5.0.7
nlp==0.4.0
nltk==3.5
nodeenv==1.5.0
notebook==6.1.3
numba==0.51.2
numexpr==2.7.1
numpy @ file:///tmp/build/80754af9/numpy_and_numpy_base_1596233721170/work
oauthlib==3.1.0
ohmeow-blurr==0.0.18
olefile==0.46
outdated==0.2.0
packaging==20.4
pandas==1.1.1
pandocfilters==1.4.2
parso==0.7.1
pathspec==0.8.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow @ file:///tmp/build/80754af9/pillow_1594307295532/work
plac==1.1.3
pluggy==0.13.1
pooch==1.1.1
pre-commit==2.7.1
preshed==3.0.2
prometheus-client==0.8.0
promise==2.3
prompt-toolkit==3.0.7
protobuf==3.14.0
psutil==5.7.3
ptyprocess==0.6.0
py==1.9.0
pyarrow==2.0.0
pycparser==2.20
Pygments==2.6.1
PyJWT==1.7.1
pylint @ file:///tmp/build/80754af9/pylint_1598623985952/work
pymdown-extensions==8.0
Pympler @ file:///tmp/build/80754af9/pympler_1602785470644/work
pyparsing==2.4.7
pyproj==3.0.0.post1
pyrsistent==0.16.0
pytest==6.0.1
pytest-cov==2.10.1
python-dateutil==2.8.1
python-slugify==4.0.1
pytorchvis==0.0.4
pytz==2020.1
PyYAML==5.3.1
pyzmq==19.0.2
qtconsole==4.7.6
QtPy==1.9.0
recommonmark==0.6.0
regex==2020.7.14
requests==2.24.0
requests-oauthlib==1.3.0
resampy==0.2.2
rouge-score==0.0.4
sacremoses==0.0.43
scikit-learn==0.23.2
scipy==1.5.2
seaborn==0.11.0
Send2Trash==1.5.0
sentencepiece==0.1.91
sentry-sdk==0.19.5
seqeval==1.2.2
Shapely==1.7.1
shortuuid==1.0.1
six==1.15.0
slugify==0.0.1
smmap==3.0.4
snoop==0.2.5
snowballstemmer==2.0.0
SoundFile==0.10.3.post1
spacy==2.3.2
Sphinx==3.2.1
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==1.0.3
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.4
SQLAlchemy==1.3.20
srsly==1.0.2
subprocess32==3.5.4
tables==3.6.1
tensor-sensor==0.1.1
terminado==0.8.3
testpath==0.4.4
text-unidecode==1.3
thinc==7.4.1
threadpoolctl==2.1.0
tokenizers==0.9.3
toml @ file:///tmp/build/80754af9/toml_1592853716807/work
torch==1.7.0
torchaudio==0.7.0a0+ac17b64
torchvision==0.8.1
torchviz==0.0.1
tornado==6.0.4
tqdm==4.48.2
traitlets==5.0.0
transformers==3.5.1
typed-ast==1.4.1
typing-extensions @ file:///tmp/build/80754af9/typing_extensions_1598376058250/work
urllib3==1.25.10
virtualenv==20.0.31
wandb==0.10.12
wasabi==0.8.0
watchdog==1.0.1
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==1.0.1
widgetsnbextension==3.5.1
wrapt==1.11.2
xxhash==2.0.0

Or with conda:

$ conda list --export 
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
_libgcc_mutex=0.1=main
absl-py=0.11.0=pypi_0
adal=1.2.4=pypi_0
alabaster=0.7.12=pypi_0
appdirs=1.4.4=pypi_0
argon2-cffi=20.1.0=pypi_0
astroid=2.4.2=py38_0
asttokens=2.0.4=pypi_0
attrs=20.1.0=pypi_0
audioread=2.1.8=pypi_0
azure-cognitiveservices-search-imagesearch=2.0.0=pypi_0
azure-common=1.1.25=pypi_0
babel=2.8.0=pypi_0
backcall=0.2.0=pypi_0
birdseye=0.8.4=pypi_0
black=20.8b1=pypi_0
blas=1.0=mkl
bleach=3.1.5=pypi_0
blis=0.4.1=pypi_0
ca-certificates=2020.10.14=0
cached-property=1.5.2=pypi_0
catalogue=1.0.0=pypi_0
certifi=2020.6.20=pyhd3eb1b0_3
cffi=1.14.2=pypi_0
cfgv=3.2.0=pypi_0
chardet=3.0.4=pypi_0
cheap-repr=0.4.4=pypi_0
click=7.1.2=pypi_0
click-plugins=1.1.1=pypi_0
cligj=0.7.1=pypi_0
colorednoise=1.1.1=pypi_0
commonmark=0.9.1=pypi_0
configparser=5.0.1=pypi_0
coverage=5.2.1=pypi_0
cryptography=3.1=pypi_0
cudatoolkit=11.0.221=h6bb024c_0
cycler=0.10.0=pypi_0
cymem=2.0.3=pypi_0
dataclasses=0.6=pypi_0
decorator=4.4.2=pypi_0
defusedxml=0.6.0=pypi_0
dill=0.3.3=pypi_0
distlib=0.3.1=pypi_0
docker-pycreds=0.4.0=pypi_0
docutils=0.16=pypi_0
einops=0.3.0=pypi_0
entrypoints=0.3=pypi_0
executing=0.5.3=pypi_0
fastai=2.1.8=dev_0
fastai-xla-extensions=0.0.1=dev_0
fastbook=0.0.11=pypi_0
fastcore=1.3.11=dev_0
fastprogress=1.0.0=pypi_0
fastscript=1.0.0=pypi_0
filelock=3.0.12=pypi_0
fiona=1.8.18=pypi_0
flask=1.1.2=pypi_0
flask-humanize=0.3.0=pypi_0
freetype=2.10.2=h5ab3b9f_0
future=0.18.2=pypi_0
geopandas=0.8.1=pypi_0
gitdb=4.0.5=pypi_0
gitpython=3.1.11=pypi_0
heartrate=0.2.1=pypi_0
humanize=3.1.0=pypi_0
identify=1.4.30=pypi_0
idna=2.10=pypi_0
imagesize=1.2.0=pypi_0
iniconfig=1.0.1=pypi_0
intel-openmp=2020.2=254
ipykernel=5.3.4=pypi_0
ipython=7.18.1=pypi_0
ipython-genutils=0.2.0=pypi_0
ipywidgets=7.5.1=pypi_0
isodate=0.6.0=pypi_0
isort=5.4.2=py38_0
itsdangerous=1.1.0=pypi_0
jedi=0.17.2=pypi_0
jinja2=2.11.2=pypi_0
joblib=0.16.0=pypi_0
jpeg=9b=h024ee3a_2
jsonschema=3.2.0=pypi_0
jupyter=1.0.0=pypi_0
jupyter-client=6.1.7=pypi_0
jupyter-console=6.2.0=pypi_0
jupyter-core=4.6.3=pypi_0
jupyter-notebook-gist=0.5.0=pypi_0
kaggle=1.5.9=pypi_0
kiwisolver=1.2.0=pypi_0
lazy-object-proxy=1.4.3=py38h7b6447c_0
lcms2=2.11=h396b838_0
ld_impl_linux-64=2.33.1=h53a641e_7
libedit=3.1.20191231=h14c3975_1
libffi=3.3=he6710b0_2
libgcc-ng=9.1.0=hdf63c60_0
libpng=1.6.37=hbc83047_0
librosa=0.8.0=pypi_0
libstdcxx-ng=9.1.0=hdf63c60_0
libtiff=4.1.0=h2733197_1
libuv=1.40.0=h7b6447c_0
littleutils=0.2.2=pypi_0
livereload=2.6.3=pypi_0
llvmlite=0.34.0=pypi_0
lunr=0.5.8=pypi_0
lz4-c=1.9.2=he6710b0_1
markdown=3.2.2=pypi_0
markupsafe=1.1.1=pypi_0
matplotlib=3.3.1=pypi_0
mccabe=0.6.1=py38_1
memory-profiler=0.58.0=pypi_0
mir-eval=0.6=pypi_0
mistune=0.8.4=pypi_0
mkautodoc=0.1.0=pypi_0
mkdocs=1.1.2=pypi_0
mkdocs-material=5.5.12=pypi_0
mkdocs-material-extensions=1.0=pypi_0
mkl=2020.2=256
mkl-service=2.3.0=py38he904b0f_0
mkl_fft=1.1.0=py38h23d657b_0
mkl_random=1.1.1=py38h0573a6f_0
mknotebooks=0.4.1=pypi_0
more-itertools=8.5.0=pypi_0
msrest=0.6.19=pypi_0
msrestazure=0.6.4=pypi_0
munch=2.5.0=pypi_0
murmurhash=1.0.2=pypi_0
mypy-extensions=0.4.3=pypi_0
nbconvert=5.6.1=pypi_0
nbdev=1.1.6=dev_0
nbformat=5.0.7=pypi_0
ncurses=6.2=he6710b0_1
ninja=1.10.0=py38hfd86e86_0
nlp=0.4.0=pypi_0
nltk=3.5=pypi_0
nodeenv=1.5.0=pypi_0
notebook=6.1.3=pypi_0
numba=0.51.2=pypi_0
numexpr=2.7.1=pypi_0
numpy=1.19.1=py38hbc911f0_0
numpy-base=1.19.1=py38hfa32c7d_0
oauthlib=3.1.0=pypi_0
ohmeow-blurr=0.0.18=pypi_0
olefile=0.46=py_0
openssl=1.1.1h=h7b6447c_0
outdated=0.2.0=pypi_0
packaging=20.4=pypi_0
pandas=1.1.1=pypi_0
pandocfilters=1.4.2=pypi_0
parso=0.7.1=pypi_0
pathspec=0.8.0=pypi_0
pexpect=4.8.0=pypi_0
pickleshare=0.7.5=pypi_0
pillow=7.2.0=py38hb39fc2d_0
pip=20.2.2=py38_0
plac=1.1.3=pypi_0
pluggy=0.13.1=pypi_0
pooch=1.1.1=pypi_0
pre-commit=2.7.1=pypi_0
preshed=3.0.2=pypi_0
prometheus-client=0.8.0=pypi_0
promise=2.3=pypi_0
prompt-toolkit=3.0.7=pypi_0
protobuf=3.14.0=pypi_0
psutil=5.7.3=pypi_0
ptyprocess=0.6.0=pypi_0
py=1.9.0=pypi_0
pyarrow=2.0.0=pypi_0
pycparser=2.20=pypi_0
pygments=2.6.1=pypi_0
pyjwt=1.7.1=pypi_0
pylint=2.6.0=py38_0
pymdown-extensions=8.0=pypi_0
pympler=0.9=py_0
pyparsing=2.4.7=pypi_0
pyproj=3.0.0.post1=pypi_0
pyrsistent=0.16.0=pypi_0
pytest=6.0.1=pypi_0
pytest-cov=2.10.1=pypi_0
python=3.8.5=hcff3b4d_0
python-dateutil=2.8.1=pypi_0
python-graphviz=0.14.1=pypi_0
python-slugify=4.0.1=pypi_0
pytorch=1.7.0=py3.8_cuda11.0.221_cudnn8.0.3_0
pytorchvis=0.0.4=pypi_0
pytz=2020.1=pypi_0
pyyaml=5.3.1=pypi_0
pyzmq=19.0.2=pypi_0
qtconsole=4.7.6=pypi_0
qtpy=1.9.0=pypi_0
readline=8.0=h7b6447c_0
recommonmark=0.6.0=pypi_0
regex=2020.7.14=pypi_0
requests=2.24.0=pypi_0
requests-oauthlib=1.3.0=pypi_0
resampy=0.2.2=pypi_0
rouge-score=0.0.4=pypi_0
sacremoses=0.0.43=pypi_0
scikit-learn=0.23.2=pypi_0
scipy=1.5.2=pypi_0
seaborn=0.11.0=pypi_0
send2trash=1.5.0=pypi_0
sentencepiece=0.1.91=pypi_0
sentry-sdk=0.19.5=pypi_0
seqeval=1.2.2=pypi_0
setuptools=49.6.0=py38_0
shapely=1.7.1=pypi_0
shortuuid=1.0.1=pypi_0
six=1.15.0=py_0
slugify=0.0.1=pypi_0
smmap=3.0.4=pypi_0
snoop=0.2.5=pypi_0
snowballstemmer=2.0.0=pypi_0
soundfile=0.10.3.post1=pypi_0
spacy=2.3.2=pypi_0
sphinx=3.2.1=pypi_0
sphinxcontrib-applehelp=1.0.2=pypi_0
sphinxcontrib-devhelp=1.0.2=pypi_0
sphinxcontrib-htmlhelp=1.0.3=pypi_0
sphinxcontrib-jsmath=1.0.1=pypi_0
sphinxcontrib-qthelp=1.0.3=pypi_0
sphinxcontrib-serializinghtml=1.1.4=pypi_0
sqlalchemy=1.3.20=pypi_0
sqlite=3.33.0=h62c20be_0
srsly=1.0.2=pypi_0
subprocess32=3.5.4=pypi_0
tables=3.6.1=pypi_0
tensor-sensor=0.1.1=pypi_0
terminado=0.8.3=pypi_0
testpath=0.4.4=pypi_0
text-unidecode=1.3=pypi_0
thinc=7.4.1=pypi_0
threadpoolctl=2.1.0=pypi_0
tk=8.6.10=hbc83047_0
tokenizers=0.9.3=pypi_0
toml=0.10.1=py_0
torchaudio=0.6.0=pypi_0
torchvision=0.8.1=py38_cu110
torchviz=0.0.1=pypi_0
tornado=6.0.4=pypi_0
tqdm=4.48.2=pypi_0
traitlets=5.0.0=pypi_0
transformers=3.5.1=pypi_0
typed-ast=1.4.1=pypi_0
typing_extensions=3.7.4.3=py_0
urllib3=1.25.10=pypi_0
virtualenv=20.0.31=pypi_0
wandb=0.10.12=pypi_0
wasabi=0.8.0=pypi_0
watchdog=1.0.1=pypi_0
wcwidth=0.2.5=pypi_0
webencodings=0.5.1=pypi_0
werkzeug=1.0.1=pypi_0
wheel=0.35.1=py_0
widgetsnbextension=3.5.1=pypi_0
wrapt=1.11.2=py38h7b6447c_0
xxhash=2.0.0=pypi_0
xz=5.2.5=h7b6447c_0
zlib=1.2.11=h7b6447c_3
zstd=1.4.5=h9ceee32_0

Show progress bar when processing text data

Currently, during the processing of the records means there is no insight into where int the dataset it is currently at, this can be improved by giving feedback on the current and total records its process.

Additional progress information will also help about the type of operation being performed.

Opened on the back discussions in #1

plt in colab notebook

The 'nlp_profiler/notebooks/google-colab/nlp_profiler.ipynb' notebook for Colab requires a matplotlib import.

Investigate and fix issue with not running on Windows-latest triggered via GitHub actions

Currently, the test-coverage.sh script does not seem to run or produce test coverage reports on the Windows instances when run via the GitHub instance, this isn't desired.

It would be good to do one of the following:

investigate why this does not work on Windows on Github actions
convert shell script into tox (using tox conventions) to make it uniform and consistent all across

Discovered when inspecting outcome of PR #19

Error related to parallelisation process when trying to using NLP Profiler

The below error was reported by @CarloLepelaars when using the NLP Profiler on a text dataset on a local machine environment with Anaconda (I have encountered a similar error as well when running NLP Profiler on Kaggle also with the Python environment set up by Anaconda).

Usage

df = apply_text_profiling(df, 'Text')

Output

Command:
```df = apply_text_profiling(df, 'Text')```


Full output:
final params: {'high_level': True, 'granular': True, 'grammar_check': False, 'spelling_check': True, 'parallelisation_method': 'default'}
Granular features: 0%
0/3 [00:01<?, ?it/s]
Granular features: Text => sentences_count: 0%
0/13 [00:01<?, ?it/s]
sentences_count: 32%
32/100 [00:20<00:01, 38.40it/s]


---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
'''
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 391, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "/opt/anaconda3/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/opt/anaconda3/lib/python3.7/site-packages/nlp_profiler/generate_features.py", line 5, in <module>
import swifter # noqa
File "/opt/anaconda3/lib/python3.7/site-packages/swifter/__init__.py", line 5, in <module>
from .swifter import SeriesAccessor, DataFrameAccessor
File "/opt/anaconda3/lib/python3.7/site-packages/swifter/swifter.py", line 14, in <module>
from .base import (
File "/opt/anaconda3/lib/python3.7/site-packages/swifter/base.py", line 4, in <module>
from psutil import cpu_count, virtual_memory
File "/opt/anaconda3/lib/python3.7/site-packages/psutil/__init__.py", line 159, in <module>
from . import _psosx as _psplatform
File "/opt/anaconda3/lib/python3.7/site-packages/psutil/_psosx.py", line 15, in <module>
from . import _psutil_osx as cext
ImportError: dlopen(/opt/anaconda3/lib/python3.7/site-packages/psutil/_psutil_osx.cpython-37m-darwin.so, 2): Symbol not found: ___CFConstantStringClassReference
Referenced from: /opt/anaconda3/lib/python3.7/site-packages/psutil/_psutil_osx.cpython-37m-darwin.so
Expected in: flat namespace
in /opt/anaconda3/lib/python3.7/site-packages/psutil/_psutil_osx.cpython-37m-darwin.so
'''

The above exception was the direct cause of the following exception:

BrokenProcessPool Traceback (most recent call last)
<ipython-input-24-96bf1218f0a1> in <module>
----> 1 df = apply_text_profiling(df, 'Text')

/opt/anaconda3/lib/python3.7/site-packages/nlp_profiler/core.py in apply_text_profiling(dataframe, text_column, params)
64 action_function(
65 action_description, new_dataframe,
---> 66 text_column, default_params[PARALLELISATION_METHOD_OPTION]
67 )
68

/opt/anaconda3/lib/python3.7/site-packages/nlp_profiler/granular_features.py in apply_granular_features(heading, new_dataframe, text_column, parallelisation_method)
45 generate_features(
46 heading, granular_features_steps,
---> 47 new_dataframe, parallelisation_method
48 )

/opt/anaconda3/lib/python3.7/site-packages/nlp_profiler/generate_features.py in generate_features(main_header, high_level_features_steps, new_dataframe, parallelisation_method)
45 new_dataframe[new_column] = parallelisation_method_function(
46 source_field, transformation_function,
---> 47 source_column, new_column
48 )
49

/opt/anaconda3/lib/python3.7/site-packages/nlp_profiler/generate_features.py in using_joblib_parallel(source_field, apply_function, source_column, new_column)
65 delayed(run_task)(
66 apply_function, each_value
---> 67 ) for _, each_value in enumerate(source_values_to_transform)
68 )
69 source_values_to_transform.update()

/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
1015
1016 with self._backend.retrieval_context():
-> 1017 self.retrieve()
1018 # Make sure that we get a last message telling us we are done
1019 elapsed_time = time.time() - self._start_time

/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py in retrieve(self)
907 try:
908 if getattr(self._backend, 'supports_timeout', False):
--> 909 self._output.extend(job.get(timeout=self.timeout))
910 else:
911 self._output.extend(job.get())

/opt/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
560 AsyncResults.get from multiprocessing."""
561 try:
--> 562 return future.result(timeout=timeout)
563 except LokyTimeoutError:
564 raise TimeoutError()

/opt/anaconda3/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
433 raise CancelledError()
434 elif self._state == FINISHED:
--> 435 return self.__get_result()
436 else:
437 raise TimeoutError()

/opt/anaconda3/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result

BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

Suggested workaround

Use NLP Profiler with the following parameters instead:

df = apply_text_profiling(df, 'Text',  params={'parallelisation_method': 'using_swifter'})

Suggested solution to issue

Surround the functionality (i.e. apply_text_profiling method) around a try...except block
Capture the details of the error (like the block in the above section) in a log file and write to disk
Print an informative message to the user:
- letting them know of the error (a single line error message)
- also mention in the logs, where to raise an issue
- where to find the details of the issue (location to the log file)
- suggest a workaround (see above section for a workaround).

Thanks for sharing the issue with us Carlo.

[BUG] Not all granular features are getting generated

Describe the bug

After running the notebook(s) on Kaggle/local machine we can see that not all granular features are getting generated for e.g. these fields 'repeated_letters_count', 'repeated_digits_count', 'repeated_spaces_count', 'repeated_whitespaces_count',
'repeated_punctuations_count', 'english_characters_count', 'non_english_characters_count' in addition to the others are not part of the dataframe, either it's not detected or something else is amiss.

To Reproduce

Run the notebook on Kaggle i.e. https://www.kaggle.com/code/neomatrix369/nlp-profiler-simple-dataset and it fails at the cell that looks for repeat characters, etc...

Version information:

NLP Profiler Version 0.0.3 - issue is not relevant to environment or any other technical parameter.
The version on the master branch also behaves in the same manner.

Additional context

From the logs on https://www.kaggle.com/code/neomatrix369/nlp-profiler-simple-dataset#Installation-and-import-libraries/packages - the 0.0.3 version on PyPi worked in the past and for some time has not been working.

'lineprofiler' does not install on Windows

As per this issue on lineprofiler's repo pyutils/line_profiler#28, there is an issue when trying to install lineprofiler using the standard method:

pip install line-profiler==3.0.2

Workaround
An alternative method has been mentioned here pyutils/line_profiler#28 (comment)

Note: discovered during creation of Github CI/CD action see #19

Suggest to loosen the dependency on textblob

Hi, your project nlp_profiler(commit id: bde13ee) requires "textblob==0.15.3" in its dependency. After analyzing the source code, we found that the following versions of textblob can also be suitable, i.e., textblob 0.9.0, 0.9.1, 0.10.0, 0.11.0, 0.11.1, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.15.0, 0.15.1, 0.15.2, since all functions that you directly (2 APIs: textblob.blob.Word.new, textblob.blob.TextBlob.init) or indirectly (propagate to 5 textblob's internal APIs and 0 outsider APIs) used from the package have not been changed in these versions, thus not affecting your usage.

Therefore, we believe that it is quite safe to loose your dependency on textblob from "textblob==0.15.3" to "textblob>=0.9.0,<=0.15.3". This will improve the applicability of nlp_profiler and reduce the possibility of any further dependency conflict with other projects.

May I pull a request to further loosen the dependency on textblob?

By the way, could you please tell us whether such an automatic tool for dependency analysis may be potentially helpful for maintaining dependencies easier during your development?

Improve the progress bars when profiling datasets

We have progress bars shown for each level of progress while datasets are profile, except that now we see a lot of them.

Reducing the details and just having a few that show all the stats would be helpful from UX/UI point of view.

Related to #3

Suggestions to potential solution

(option 1: ideal solution): show all 3 levels of progress bars and only 3 progress bars which dynamically change and show progress
- currently, the 2nd and 3rd level progress bars are getting re-created (and appended to the visual cue), we would like this to not happen
(option 2: easy) disable the 3 level (row-level) progress bar
- hide this behind a toggle switch that can be passed as a param, by default the parameter switches off the progress bar
(option 3: half-way): show all the details in two levels of progress bar which change dynamically and do not get re-created for each iteration

How do we know it works well and there is no regress?

run the notebooks in the notebooks folder to see if the changes are working as expected
run the test coverage shell script to see tests pass 100% and coverage is 100%

[BUG] Improving/changing the spell checker leads to tests breaking, implementation changing

Describe the bug

On the back of the PR #71 which was reverted from master due to breaking changes, would be best to review, and fix either implement/tests or both and merge to master.

To Reproduce

Clone the branch related to PR #71 and run all the tests.

Version information:

Version 0.0.3 - no other aspects are linked to it

Additional context

The steps would be following in order to apply and test the improvements to the spell checker:

get latest changes on local master
branch from master for spell checker changes
run all the tests with ./test-coverage.sh tests slow-tests
apply the spell checker changes as applied previously via PR #71
run all the tests with ./test-coverage.sh tests slow-tests
fix tests such that the tests are passing for the right reasons
create Pull Request and follow guidelines (GA on PR should indicate if the changes are good or not)

PR #71 was reverted via PR #75.

Able to run at scale: handle larger datasets

At the moment the library runs slow and takes a long time to handle large datasets due to the processing require per record, this could be optimised and improved in small steps to be able to handle larger datasets

Opened on the back discussions in #1. Partially related to #3 although independent of the issue.

Not available on Pypi

This package is not available on PyPi, plans on when this is going to happen?

Also provide ways to install directly with pipenv

[FEATURE] High level topic modelling

Would like to discuss if a high level topic modelling method would be of functional use. If yes, I would also like to discuss potential methods for the same.

Improve logic behind spell checking text

Core issue
We have a spell checking functionality in NLP Profiler which uses a third-party library i.e. TextBlob, it does a decent job although the scores returned per misspelt word would then need to be correctly amortised across the whole text.

Meaning, in a fair fashion evaluate on the whole how bad is the spelling in the text.

At the moment it's using the below logic:

def spelling_quality_score(text: str) -> float:
    if (not isinstance(text, str)) or (len(text.strip()) == 0):
        return NaN

    tokenized_text = get_tokenized_text(text)
    misspelt_words = [
        each_word for _, each_word in enumerate(tokenized_text)
        if actual_spell_check(each_word) is not None
    ]
    avg_words_per_sentence = \
        len(tokenized_text) / get_sentence_count(text)
    result = 1 - (len(misspelt_words) / avg_words_per_sentence)

    return result if result >= 0.0 else 0.0

Which can be improved as there are visible chances of false positive or false negative scores.

PS: performance of this feature is being addressed on #2, so this particular issue isn't about improving it's speed/performance. Performance issues may be addressed via other issues at a later stage. There has already been some significant performance improvements to the spell check and other aspects of NLP Profiler via #2.

Fix to #14 impacts, this issue, will need to also be fixed together.

Secondary issue

~~Replace the spellchecker with the package pyspellchecker (on PyPi) which appears to be closer to Peter Norvig's work.~~ Replaced with Symspellpy (https://pypi.org/project/symspellpy/)

[FEATURE] Test Package is installing on Python 3.9 via GitHub Actions

Missing functionality

At the moment we are not sure if the NLP Profiler will work perfectly fine using python 3.9 - and it would be good to have support for 3.9 to start with as we have phased out 3.6 (due to one or more dependencies not supporting Python 3.6).

Proposed feature

Create a Github action directive (pipeline) for Python 3.9 just like we have for Python 3.7 and 3.8.

Alternatives considered
N/A

Additional context
N/A

Collecting git+https://github.com/neomatrix369/nlp_profiler.git@master
  Cloning https://github.com/neomatrix369/nlp_profiler.git (to revision master) to /tmp/pip-req-build-mggu48uy
Once successfully installed, please restart your Jupyter kernels for the changes to take effect
  Running command git clone -q https://github.com/neomatrix369/nlp_profiler.git /tmp/pip-req-build-mggu48uy
ERROR: Package 'nlp-profiler' requires a different Python: 3.6.9 not in '>=3.7.0'

The package requires python version to be 3.7
My local python version is 3.6.8 and colab python version is 3.6.9
Therefore am getting this error.

neomatrix369 / nlp_profiler Goto Github PK

nlp_profiler's Introduction

NLP Profiler

Table of contents

What do you get from the library?

Requirements

Getting started

Installation

Usage

Developer guide

Demo and presentations

Notebooks

Screenshots

Credits and supporters

Changes

License

Contributing

nlp_profiler's People

Contributors

Stargazers

Watchers

Forkers

nlp_profiler's Issues

Recommend Projects

Recommend Topics

Recommend Org