davidmogar / cucco Goto Github PK
View Code? Open in Web Editor NEWText normalization library for Python
License: MIT License
Text normalization library for Python
License: MIT License
Having a way to define what normalizations to apply would be handy once the CLI is in place (#33). A YAML file is probably the best option to have an easy to read/define config file.
I installed the library via pip but when I try to use it within a script I'm getting the following error:
ImportError: No module named regex
Following the traceback, apparently the error occurs when importing normalizr.regex as regex
I want to have a stable, tested and functional branch (master) so a new branch for developing and for QA is needed.
Is my intention to have a great test coverage for cucco so I'm setting my goal to 100% test coverage. To achieve this tests have to be added still for batch and the new cli.
I started using reStructuredText as Pypi don't really like Markdown. Can't stand it anymore! ๐
Move Readme to Markdown and improve it.
Right now, stop words are removed only for the language specified when instantiating the cucco. For the library to be used in an API, it is needed to be able to specify the language when calling stop words function. It is also needed to have all stop words file in memory.
As pointed in #17, at some point a bug was added:
Python 2.7.12 (default, Jul 1 2016, 15:12:24) [33/147]
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from normalizr import Normalizr
>>> normalizr = Normalizr(language='en')
>>> print(normalizr.normalize(u'Who let the dog out?'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "normalizr/normalizr.py", line 85, in normalize
for normalization, kwargs in self._parse_normalizations(normalizations or DEFAULT_NORMALIZATIONS):
ValueError: too many values to unpack
This seems to be related with string types (str and unicode).
This library is so active that makes so difficult to find Normalizr. Would be nice to have a new name. Thinking about it...
Currently is possible to normalize files in a given path but there is no way to normalize a single file. This functionality should be added.
Test suite works only with Python >=3. Previous versions should be supported too.
Also, there is an error on the question_mark test. After text should be fixed.
/cc @feinsteinben
Executing the following code freezes cucco completely
cucco.normalize('[http://example.com/test.jpg](http://example.com/test.jpg)', ['replace_urls'])
Leaving out only one (
or the .jpg
solves this. Very strange...
Can you either completely remove the print('loading') statement in line 42 of normalizr.py or have it suppressed by default with option to turn it on if needed?
I'm applying the function to a pandas dataframe via .apply function and it gets really noisy having 'loading' printing out unnecessarily a whole bunch of times.
This should be a really easy to implement and would help on extending the library and make it more usable.
These days, social networks (i.e., Twitter) are a crucial source of texts. Social-specific normalizations (e.g., replace_hashtags, replace_mentions) are quite useful for these tasks.
I'd like to add these capabilities to Normalizr.
Cheers,
Ben
Thanks for This amazing library!
It seems that replace_punctuation is missing these 2 characters: โ โ
Hi guys,
Let's say that I wanna normalise that string:
"Protein Recommendations for Bodybuilders: In This Case, More May Indeed Be Better.
Without extra Cucco setup (normalizations) I received:
"Protein Recommendations Bodybuilders Case".
With extra Cucco setup:
normalizations = [
'remove_extra_whitespaces',
'remove_accent_marks',
'remove_stop_words',
('replace_hyphens', {'replacement': ''}),
('replace_punctuation', {'replacement': ''}),
('replace_symbols', {'replacement': ''}),
]
I received:
"Protein Recommendations Bodybuilders Case Better"
My question is: where is the rest part of the string?
If the library is used for text normalization, stemming is a must.
The setup.py scripts try to read the old readme file.
It's fixed in PR #22.
w = 'Car , 950'
cucco.normalize(w)
The program seems to check for whitespace to remove before removing punctuation. This causes it to return 'Car__950' rather than 'Car_950'.
ETA: added underscore in place of spaces to show effect.
The new emojis like ๐ค, ๐ฅ, ๐ค, ๐คfail to be removed.
Check this gist https://gist.github.com/octohedron/3823d081eb1b92abe93b570875ec77f4
The current package has become too old and should be updated. It also doesn't work with Python 2.
This is also a good moment to change the name (issue #12).
Hi, I tried the code in README. However I got this error. My Python is 2.7.
Traceback (most recent call last):
File "remove-accent.py", line 5, in <module>
from normalizr import Normalizr
File "/usr/local/lib/python2.7/site-packages/normalizr/__init__.py", line 3, in <module>
from normalizr.normalizr import Normalizr
File "/usr/local/lib/python2.7/site-packages/normalizr/normalizr.py", line 7, in <module>
import normalizr.regex as regex
ImportError: No module named regex
There isn't much documentation on how to use the stop-words list - and would it make sense to add the capability to use a custom stop-word list rather than having to modify an existing one? Or does that capability already exist?
Now that there is a pull request that adds some tests (#10) would be nice to integrate the project with Travis. On hold till the requests gets merged.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.