davidmogar / cucco Goto Github PK

View Code? Open in Web Editor NEW

201.0 11.0 29.0 192 KB

Text normalization library for Python

License: MIT License

Python 99.29% Shell 0.71%

python normalization language punctuation python-library cucco text manipulation

cucco's People

Contributors

Stargazers

Watchers

cucco's Issues

Support for config files

Having a way to define what normalizations to apply would be handy once the CLI is in place (#33). A YAML file is probably the best option to have an easy to read/define config file.

ImportError: No module named regex

I installed the library via pip but when I try to use it within a script I'm getting the following error:

ImportError: No module named regex

Following the traceback, apparently the error occurs when importing normalizr.regex as regex

Create a develop branch

I want to have a stable, tested and functional branch (master) so a new branch for developing and for QA is needed.

100% test coverage

Is my intention to have a great test coverage for cucco so I'm setting my goal to 100% test coverage. To achieve this tests have to be added still for batch and the new cli.

Move to Markdown

I started using reStructuredText as Pypi don't really like Markdown. Can't stand it anymore! 😝

Move Readme to Markdown and improve it.

Prepare library to be used by an API

Right now, stop words are removed only for the language specified when instantiating the cucco. For the library to be used in an API, it is needed to be able to specify the language when calling stop words function. It is also needed to have all stop words file in memory.

Normalizer unpacking not working on Python 2

As pointed in #17, at some point a bug was added:

Python 2.7.12 (default, Jul  1 2016, 15:12:24)                                                                              [33/147]
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from normalizr import Normalizr
>>> normalizr = Normalizr(language='en')
>>> print(normalizr.normalize(u'Who let the dog out?'))                                                                             
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "normalizr/normalizr.py", line 85, in normalize
    for normalization, kwargs in self._parse_normalizations(normalizations or DEFAULT_NORMALIZATIONS):
ValueError: too many values to unpack

This seems to be related with string types (str and unicode).

New name for the library

This library is so active that makes so difficult to find Normalizr. Would be nice to have a new name. Thinking about it...

Allow to normalize a single file

Currently is possible to normalize files in a given path but there is no way to normalize a single file. This functionality should be added.

Adapt tests to work with Python 2.7 and fix errors

Test suite works only with Python >=3. Previous versions should be supported too.

Also, there is an error on the question_mark test. After text should be fixed.

/cc @feinsteinben

cucco freezes for certain file string using replace_urls

Executing the following code freezes cucco completely

cucco.normalize('[http://example.com/test.jpg](http://example.com/test.jpg)', ['replace_urls'])

Leaving out only one ( or the .jpg solves this. Very strange...

"loading" text

Can you either completely remove the print('loading') statement in line 42 of normalizr.py or have it suppressed by default with option to turn it on if needed?

I'm applying the function to a pandas dataframe via .apply function and it gets really noisy having 'loading' printing out unnecessarily a whole bunch of times.

Command line tool

This should be a really easy to implement and would help on extending the library and make it more usable.

cucco didn`t install requirements like yaml

I installed pyyaml by myself but then i have the error
module cucco has no attribute regex

Add social-specific normalizations

These days, social networks (i.e., Twitter) are a crucial source of texts. Social-specific normalizations (e.g., replace_hashtags, replace_mentions) are quite useful for these tasks.

I'd like to add these capabilities to Normalizr.

Cheers,
Ben

Some punctuation characters missing

Thanks for This amazing library!

It seems that replace_punctuation is missing these 2 characters: ” “

Incomplete normalization

Hi guys,

Let's say that I wanna normalise that string:

"Protein Recommendations for Bodybuilders: In This Case, More May Indeed Be Better.

Without extra Cucco setup (normalizations) I received:

"Protein Recommendations Bodybuilders Case".

With extra Cucco setup:

normalizations = [
	'remove_extra_whitespaces',
	'remove_accent_marks',
	'remove_stop_words',
	('replace_hyphens', {'replacement': ''}),
	('replace_punctuation', {'replacement': ''}),
	('replace_symbols', {'replacement': ''}),
]

I received:

"Protein Recommendations Bodybuilders Case Better"

My question is: where is the rest part of the string?

Add stemming capabilities

If the library is used for text normalization, stemming is a must.

Broken setup.py

The setup.py scripts try to read the old readme file.
It's fixed in PR #22.

Order of operations

w = 'Car , 950'
cucco.normalize(w)

The program seems to check for whitespace to remove before removing punctuation. This causes it to return 'Car__950' rather than 'Car_950'.

ETA: added underscore in place of spaces to show effect.

Emojis are not removed

The new emojis like 🤗, 🥂, 🤔, 🤘fail to be removed.

Check this gist https://gist.github.com/octohedron/3823d081eb1b92abe93b570875ec77f4

Update Pypi package

The current package has become too old and should be updated. It also doesn't work with Python 2.

This is also a good moment to change the name (issue #12).

Got an error from sample code.

Hi, I tried the code in README. However I got this error. My Python is 2.7.

Traceback (most recent call last):
  File "remove-accent.py", line 5, in <module>
    from normalizr import Normalizr
  File "/usr/local/lib/python2.7/site-packages/normalizr/__init__.py", line 3, in <module>
    from normalizr.normalizr import Normalizr
  File "/usr/local/lib/python2.7/site-packages/normalizr/normalizr.py", line 7, in <module>
    import normalizr.regex as regex
ImportError: No module named regex

Stop Words suggestion

There isn't much documentation on how to use the stop-words list - and would it make sense to add the capability to use a custom stop-word list rather than having to modify an existing one? Or does that capability already exist?

Prepare tests and integrate it with Travis

Now that there is a pull request that adds some tests (#10) would be nice to integrate the project with Travis. On hold till the requests gets merged.

davidmogar / cucco Goto Github PK

cucco's People

Contributors

Stargazers

Watchers

Forkers

cucco's Issues

Recommend Projects

Recommend Topics

Recommend Org