Giter Site home page Giter Site logo

textpipe / textpipe Goto Github PK

View Code? Open in Web Editor NEW
300.0 300.0 27.0 348 KB

Textpipe: clean and extract metadata from text

License: MIT License

Python 99.42% Shell 0.58%
language-identification named-entities named-entity-recognition nlp text-analysis text-processing

textpipe's Introduction

THIS REPOSITORY IS NO LONGER MAINTAINED

textpipe: clean and extract metadata from text

Build Status

The textpipe logo

textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata from that text. Its functionalities include transforming raw text into readable text by removing HTML tags and extracting metadata such as the number of words and named entities from the text.

Vision: the zen of textpipe

  • Designed for use in production pipelines without adult supervision.
  • Rechargeable batteries included: provide sane defaults and clear examples to adapt.
  • A uniform interface with thin wrappers around state-of-the-art NLP packages.
  • As language-agnostic as possible.
  • Bring your own models.

Features

  • Clean raw text by removing HTML and other unreadable constructs
  • Identify the language of text
  • Extract the number of words, number of sentences, named entities from a text
  • Calculate the complexity of a text
  • Obtain text metadata by specifying a pipeline containing all desired elements
  • Obtain sentiment (polarity and a subjectivity score)
  • Generates word counts
  • Computes minhash for cheap similarity estimation of documents

Installation

It is recommended that you install textpipe using a virtual environment.

python3 -m venv .venv
  • Using virtualenv.
virtualenv venv -p python3.6
  • Using virtualenvwrapper
mkvirtualenv textpipe -p python3.6
  • Install textpipe using pip.
pip install textpipe
  • Install the required packages using requirements.txt.
pip install -r requirements.txt

A note on spaCy download model requirement

While the requirements.txt file that comes with the package calls for spaCy's en_core_web_sm model, this can be changed depending on the model and language you require for your intended use. See spaCy.io's page on their different models for more information.

Usage example

>>> from textpipe import doc, pipeline
>>> sample_text = 'Sample text! <!DOCTYPE>'
>>> document = doc.Doc(sample_text)
>>> print(document.clean)
'Sample text!'
>>> print(document.language)
'en'
>>> print(document.nwords)
2

>>> pipe = pipeline.Pipeline(['CleanText', 'NWords'])
>>> print(pipe(sample_text))
{'CleanText': 'Sample text!', 'NWords': 3}

In order to extend the existing Textpipe operations with your own proprietary operations;

test_pipe = pipeline.Pipeline(['CleanText', 'NWords'])
def custom_op(doc, context=None, settings=None, **kwargs):
    return 1

custom_argument = {'argument' :1 }
test_pipe.register_operation('CUSTOM_STEP', custom_op)
test_pipe.steps.append(('CUSTOM_STEP', custom_argument ))

Contributing

See CONTRIBUTING for guidelines for contributors.

Changes

0.12.1

  • Bumps redis, tqdm, pyling

0.12.0

  • Bumps versions of many dependencies including textacy. Results for keyterm extraction changed.

0.11.9

  • Exposes arbitrary SpaCy ents properties

0.11.8

  • Exposes SpaCy's cats attribute

0.11.7

  • Bumps spaCy and redis versions

0.11.6

  • Fixes bug where gensim model is not cached in pipeline

0.11.5

  • Raise TextpipeMissingModelException instead of KeyError

0.11.4

  • Bumps spaCy and datasketch dependencies

0.11.1

  • Replaces codacy with pylint on CI
  • Fixes pylint issues

0.11.0

  • Adds wrapper around Gensim keyed vectors to construct document embeddings from Redis cache

0.9.0

  • Adds functionality to compute document embeddings using a Gensim word2vec model

0.8.6

  • Removes non standard utf chars before detecting language

0.8.5

  • Bump spaCy to 2.1.3

0.8.4

  • Fix broken install command

0.8.3

  • Fix broken install command

0.8.2

  • Fix copy-paste error in word vector aggregation (#118)

0.8.1

  • Fixes bugs in several operations that didn't accept kwargs

0.8.0

  • Bumps Spacy to 2.1

0.7.2

  • Pins Spacy and Pattern versions (with pinned lxml)

0.7.0

  • change operation's registry from list to dict
  • global pipeline data is available across operations via the context kwarg
  • load custom operations using register_operation in pipeline
  • custom steps (operations) with arguments

textpipe's People

Contributors

anneschuth avatar bartdegoede avatar bcambel avatar dependabot-preview[bot] avatar dependabot-support avatar dodijk avatar graus avatar jinxcat avatar joostdewit avatar kgbicheno avatar lmdehaas avatar marliesvanderwees avatar tanjacrijns avatar vascovisser avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

textpipe's Issues

Sync local and remote code linting and unit tests.

The behavior of script/test should be in sync with the codacy/travis tests. I want to run script/test in a pre-commit hook and if this script fails this should mean that codacy/travis would fail if the commit was to be pushed. If the script passes then the codacy/travis should also pass. I don't want to commit/push anything that won't pass testing upstream.

Potential dependency conflicts between textpipe and numpy

Hi, as shown in the following full dependency graph of textpipe, textpipe requires numpy <1.19,>=1.18.0, textacy requires pyemd >=0.5.0 (pyemd 0.5.1 will be installed, i.e., the newest version satisfying the version constraint), and dependency pyemd 0.5.1 transitively introduces numpy <2.0.0,>=1.9.0.

Obviously, there are multiple version constraints set for numpy in this project. However, according to pip's “first found wins” installation strategy, numpy 1.18.4 (i.e., the newest version satisfying constraint <1.19,>=1.18.0) is the actually installed version.

Although the first found package version numpy 1.18.4 just satisfies the later dependency constraint (numpy <1.19,>=1.18.0), such installed version is very close to the upper bound of the version constraint of numpy specified by pyemd 0.5.1.

Once pyemd upgrades,its newest version will be installed, as textpipe does not specify the upper bound of version constraint for pyemd. Therefore, it will easily cause a dependency conflict (build failure), if the upgraded pyemd version introduces a higher version of numpy, violating its another version constraint <1.19,>=1.18.0.

According to the release history of pyemd, it habitually upgrates Numpy in its recent releases. For instance, pyemd 0.4.0 upgrated Numpy’s constraint from >=1.8.0, <2.0.0 to >=1.10.0, <2.0.0, and pyemd 0.4.2 upgrated Numpy’s constraint from >=1.10.0, <2.0.0 to >=1.9.0, <2.0.0.

As such, it is a warm warning of a potential dependency conflict issue for textpipe.

Dependency tree

textpipe - 0.11.10
| +- beautifulsoup4(install version:4.9.1 version range:<5,>=4.8)
| +- cld2-cffi(install version:0.1.4 version range:<1,>=0.1)
| | +- cffi(install version:1.14.0 version range:*)
| | +- six(install version:1.14.0 version range:*)
| +- datasketch(install version:1.5.1 version range:<1.6,>=1.5.0)
| | +- numpy(install version:1.18.4 version range:>=1.11)
| +- gensim(install version:3.8.3 version range:<3.9,>=3.8.1)
| +- msgpack(install version:0.6.2 version range:<1,>=0.6)
| +- numpy(install version:1.18.4 version range:<1.19,>=1.18.0)
| +- redis(install version:3.4.1 version range:==3.4.1)
| +- spacy(install version: version range:<.3,>=2.2.3)
| +- textacy(install version:0.9.1 version range:<0.10,>=0.9.1)
| | +- cachetools(install version:4.1.0 version range:>=2.0.1)
| | +- cytoolz(install version:0.10.1 version range:>=0.8.0)
| | | +- toolz(install version:0.10.0 version range:>=0.8.0)
| | +- jellyfish(install version: version range:>=0.7.0)
| | +- joblib(install version:0.14.1 version range:>=0.13.0)
| | +- networkx(install version:2.4 version range:>=2.0)
| | | +- decorator(install version:4.4.2 version range:>=4.3.0)
| | +- numpy(install version:1.18.4 version range:>=1.17.0)
| | +- pyemd(install version:0.5.1 version range:>=0.5.0)
| | | +- numpy(install version:1.18.4 version range:<2.0.0,>=1.9.0)
| | +- pyphen(install version:0.9.5 version range:>=0.9.4)
| | +- requests(install version:2.23.0 version range:>=2.10.0)
| | | +- certifi(install version:2020.4.5.1 version range:>=2017.4.17)
| | | +- chardet(install version:3.0.4 version range:>=3.0.2,<4)
| | | +- idna(install version:2.9 version range:>=2.5,<3)
| | | +- urllib3(install version:1.25.9 version range:>=1.21.1,<1.26)
| | +- scikit-learn(install version:0.22.2.post1 version range:>=0.19.0)
| | +- scipy(install version:1.2.3 version range:>=0.17.0)
| | +- spacy(install version:2.2.4 version range:>=2.0.12)
| | +- srsly(install version:2.0.1 version range:>=0.0.5)
| | +- tqdm(install version:4.41.1 version range:>=4.19.6)
| +- textpipe-pattern(install version:3.6.1 version range:==3.6.1)
| +- tqdm(install version:4.41.1 version range:<4.42,>=4.41.1)

Thanks for your help.
Best,
Neolith

Implement pipeline as a list of tuples

Because each step also needs associated parameters, the current definition format is insufficient.

Using a dictionary or a list of dictionaries is deemed to be too verbose. Using tuples of size 2 that the first element is the name of the processor and the second element is a dictionary of the relevant parameters seems to be an acceptable approach.

For decreasing verbosity, the second element dictionary should be defined beforehand.

Add Installation instructions

It's handy to add Install instructures to the Main README.md page; such as

Installation

It's highly suggested to create a virtual environment to deal with the site-packages.
Please use virtualenvwrapper or virtualenv

virtualenv venv -p python3.6
# or 
mkvirtualenv textpipe -p python3.6
# install the packages
pip install -r requirements.txt

Gensim models are not cached in pipeline

What I see in the debugger, is

I don't see Doc._gensim_vectors being passed back to the pipeline object, so this keeps repeating at each loop iteration

test_entities() only tests `None == None`

The following test is currently in test_doc.py:

def test_entities():
    assert DOC_1.ents.sort() == ['Google'].sort()
    assert DOC_2.ents.sort() == ['Textmining', 'Concreet', 'Philips'].sort()
    assert DOC_3.ents == []

As ents.sort() does not return anything, this should be replaced with sorted(ents). The current test should not pass but does so as it only tests None == None.

Usage example is actually wrong

I tested the Usage example with

sample_text = 'Sample text! <!DOCTYPE>'

The 'NWords' is actually = 3, counting '!' as a word.

pasted graphic 4

spacy 3.0

Hello,
is it possible to make textpipe and textpipe-pattern compatible with spacy 3.0?
Thank you

Setup CI

Currently running on an internal Jenkins instance. Maybe transfer to Travis?

Tests run with:

# Install cld2-cffi package with an additional CFLAG
CFLAGS="-Wno-narrowing" pip3 install --quiet cld2-cffi

# Install requirements
pip3 install --quiet --requirement requirements.txt

# Install and run pylint
pip3 install --quiet --upgrade pylint 
python3 -m pylint -f parseable textpipe | tee pylint.out

# Download languages needed for testing
python3 -m spacy download nl > /dev/null
python3 -m spacy download en > /dev/null

# Run test with pytest, including doctests
pip3 install --quiet --upgrade pytest
python3 -m pytest --doctest-modules --junit-xml=pytest.xml

# Run unit tests again, but with nose for coverage report
pip3 install --quiet --upgrade nosexcover
python3 -m nose --with-doctest --with-xcoverage --with-xunit --cover-package=textpipe --cover-erase 2> /dev/null

distutils.errors.CompileError when installing textpipe

I get the following error when I try to install textpipe using Python 3.6.8 and Ubuntu 18.04.

Any thoughts? I can't seem to find any helpful solutions on the interwebs.

Traceback (most recent call last):
File "/usr/lib/python3.6/distutils/unixccompiler.py", line 118, in _compile
extra_postargs)
File "/usr/lib/python3.6/distutils/ccompiler.py", line 909, in spawn
spawn(cmd, dry_run=self.dry_run)
File "/usr/lib/python3.6/distutils/spawn.py", line 36, in spawn
_spawn_posix(cmd, search_path, dry_run=dry_run)
File "/usr/lib/python3.6/distutils/spawn.py", line 159, in _spawn_posix
% (cmd, exit_status))
distutils.errors.DistutilsExecError: command 'x86_64-linux-gnu-gcc' failed with exit status 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/pip-build-8bvewwgk/cld2-cffi/.eggs/cffi-1.12.3-py3.6-linux-x86_64.egg/cffi/ffiplatform.py", line 51, in _build
    dist.run_command('build_ext')
  File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/usr/lib/python3/dist-packages/setuptools/command/build_ext.py", line 78, in run
    _build_ext.run(self)
  File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run
    self.build_extensions()
  File "/usr/lib/python3.6/distutils/command/build_ext.py", line 448, in build_extensions
    self._build_extensions_serial()
  File "/usr/lib/python3.6/distutils/command/build_ext.py", line 473, in _build_extensions_serial
    self.build_extension(ext)
  File "/usr/lib/python3/dist-packages/setuptools/command/build_ext.py", line 199, in build_extension
    _build_ext.build_extension(self, ext)
  File "/usr/lib/python3.6/distutils/command/build_ext.py", line 533, in build_extension
    depends=ext.depends)
  File "/usr/lib/python3.6/distutils/ccompiler.py", line 574, in compile
    self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
  File "/usr/lib/python3.6/distutils/unixccompiler.py", line 120, in _compile
    raise CompileError(msg)
distutils.errors.CompileError: command 'x86_64-linux-gnu-gcc' failed with exit status 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-build-8bvewwgk/cld2-cffi/setup.py", line 191, in <module>
    'Topic :: Text Processing :: Linguistic'
  File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 129, in setup
    return distutils.core.setup(**attrs)
  File "/usr/lib/python3.6/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/usr/lib/python3/dist-packages/setuptools/command/egg_info.py", line 278, in run
    self.find_sources()
  File "/usr/lib/python3/dist-packages/setuptools/command/egg_info.py", line 293, in find_sources
    mm.run()
  File "/usr/lib/python3/dist-packages/setuptools/command/egg_info.py", line 524, in run
    self.add_defaults()
  File "/usr/lib/python3/dist-packages/setuptools/command/egg_info.py", line 560, in add_defaults
    sdist.add_defaults(self)
  File "/usr/lib/python3/dist-packages/setuptools/command/py36compat.py", line 34, in add_defaults
    self._add_defaults_python()
  File "/usr/lib/python3/dist-packages/setuptools/command/sdist.py", line 127, in _add_defaults_python
    build_py = self.get_finalized_command('build_py')
  File "/usr/lib/python3.6/distutils/cmd.py", line 299, in get_finalized_command
    cmd_obj.ensure_finalized()
  File "/usr/lib/python3.6/distutils/cmd.py", line 107, in ensure_finalized
    self.finalize_options()
  File "/usr/lib/python3/dist-packages/setuptools/command/build_py.py", line 34, in finalize_options
    orig.build_py.finalize_options(self)
  File "/usr/lib/python3.6/distutils/command/build_py.py", line 45, in finalize_options
    ('force', 'force'))
  File "/usr/lib/python3.6/distutils/cmd.py", line 287, in set_undefined_options
    src_cmd_obj.ensure_finalized()
  File "/usr/lib/python3.6/distutils/cmd.py", line 107, in ensure_finalized
    self.finalize_options()
  File "/tmp/pip-build-8bvewwgk/cld2-cffi/setup.py", line 143, in finalize_options
    self.distribution.ext_modules = get_ext_modules()
  File "/tmp/pip-build-8bvewwgk/cld2-cffi/setup.py", line 128, in get_ext_modules
    import cld2
  File "/tmp/pip-build-8bvewwgk/cld2-cffi/cld2/__init__.py", line 190, in <module>
    extra_compile_args=_COMPILER_ARGS)
  File "/tmp/pip-build-8bvewwgk/cld2-cffi/.eggs/cffi-1.12.3-py3.6-linux-x86_64.egg/cffi/api.py", line 464, in verify
    lib = self.verifier.load_library()
  File "/tmp/pip-build-8bvewwgk/cld2-cffi/.eggs/cffi-1.12.3-py3.6-linux-x86_64.egg/cffi/verifier.py", line 104, in load_library
    self._compile_module()
  File "/tmp/pip-build-8bvewwgk/cld2-cffi/.eggs/cffi-1.12.3-py3.6-linux-x86_64.egg/cffi/verifier.py", line 201, in _compile_module
    outputfilename = ffiplatform.compile(tmpdir, self.get_extension())
  File "/tmp/pip-build-8bvewwgk/cld2-cffi/.eggs/cffi-1.12.3-py3.6-linux-x86_64.egg/cffi/ffiplatform.py", line 22, in compile
    outputfilename = _build(tmpdir, ext, compiler_verbose, debug)
  File "/tmp/pip-build-8bvewwgk/cld2-cffi/.eggs/cffi-1.12.3-py3.6-linux-x86_64.egg/cffi/ffiplatform.py", line 58, in _build
    raise VerificationError('%s: %s' % (e.__class__.__name__, e))
cffi.VerificationError: CompileError: command 'x86_64-linux-gnu-gcc' failed with exit status 1

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-8bvewwgk/cld2-cffi/

TypeError while installing through pip

pip install textpipe
Collecting textpipe
  Downloading https://files.pythonhosted.org/packages/44/0c/e7fafbda3caa0c7f8d6ffbe144d081a596c80d834e17943bbe0f5b31b8e9/textpipe-0.8.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-amdf0b/textpipe/setup.py", line 13, in <module>
        with open(Path(__file__).resolve().parent.joinpath('README.md'), 'r') as fh:
    TypeError: coercing to Unicode: need string or buffer, PosixPath found
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-amdf0b/textpipe/

Add logging

In order to expose some of the pipeline processing, we need to add logging in Doc and Pipeline classes.

Improve documentation

Currently the documentation is a bit bare-bones. I think it would help in particular to explain the fundamental concepts such as the pipeline, operations, steps, doc properties, and how they interrelate.

Add to awesome-nlp?

Hey, I'd like to have textpipe over at awesome-nlp. The repo seems like some really good work.

If you'd be kind enough to add a few more usage examples/tutorials, it'd get awesome enough for us to add it there :)

Can you please raise a PR when you are ready?

Add Pipeline serialization

To load and save pipelines (for, e.g., having the same pipeline for pre-processing docs for training and predicting), we need to be able to serialize pipelines.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.