Giter Site home page Giter Site logo

cucco's People

Contributors

armaggedon avatar davidmogar avatar efueger avatar franck-dernoncourt avatar luzfcb avatar xecgr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cucco's Issues

Support for config files

Having a way to define what normalizations to apply would be handy once the CLI is in place (#33). A YAML file is probably the best option to have an easy to read/define config file.

ImportError: No module named regex

I installed the library via pip but when I try to use it within a script I'm getting the following error:

ImportError: No module named regex

Following the traceback, apparently the error occurs when importing normalizr.regex as regex

Create a develop branch

I want to have a stable, tested and functional branch (master) so a new branch for developing and for QA is needed.

100% test coverage

Is my intention to have a great test coverage for cucco so I'm setting my goal to 100% test coverage. To achieve this tests have to be added still for batch and the new cli.

Move to Markdown

I started using reStructuredText as Pypi don't really like Markdown. Can't stand it anymore! ๐Ÿ˜

Move Readme to Markdown and improve it.

Prepare library to be used by an API

Right now, stop words are removed only for the language specified when instantiating the cucco. For the library to be used in an API, it is needed to be able to specify the language when calling stop words function. It is also needed to have all stop words file in memory.

Normalizer unpacking not working on Python 2

As pointed in #17, at some point a bug was added:

Python 2.7.12 (default, Jul  1 2016, 15:12:24)                                                                              [33/147]
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from normalizr import Normalizr
>>> normalizr = Normalizr(language='en')
>>> print(normalizr.normalize(u'Who let the dog out?'))                                                                             
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "normalizr/normalizr.py", line 85, in normalize
    for normalization, kwargs in self._parse_normalizations(normalizations or DEFAULT_NORMALIZATIONS):
ValueError: too many values to unpack

This seems to be related with string types (str and unicode).

Allow to normalize a single file

Currently is possible to normalize files in a given path but there is no way to normalize a single file. This functionality should be added.

"loading" text

Can you either completely remove the print('loading') statement in line 42 of normalizr.py or have it suppressed by default with option to turn it on if needed?

I'm applying the function to a pandas dataframe via .apply function and it gets really noisy having 'loading' printing out unnecessarily a whole bunch of times.

Command line tool

This should be a really easy to implement and would help on extending the library and make it more usable.

Add social-specific normalizations

These days, social networks (i.e., Twitter) are a crucial source of texts. Social-specific normalizations (e.g., replace_hashtags, replace_mentions) are quite useful for these tasks.

I'd like to add these capabilities to Normalizr.

Cheers,
Ben

Incomplete normalization

Hi guys,

Let's say that I wanna normalise that string:

"Protein Recommendations for Bodybuilders: In This Case, More May Indeed Be Better.

Without extra Cucco setup (normalizations) I received:

"Protein Recommendations Bodybuilders Case".

With extra Cucco setup:

normalizations = [
	'remove_extra_whitespaces',
	'remove_accent_marks',
	'remove_stop_words',
	('replace_hyphens', {'replacement': ''}),
	('replace_punctuation', {'replacement': ''}),
	('replace_symbols', {'replacement': ''}),
]

I received:

"Protein Recommendations Bodybuilders Case Better"

My question is: where is the rest part of the string?

Broken setup.py

The setup.py scripts try to read the old readme file.
It's fixed in PR #22.

Order of operations

w = 'Car , 950'
cucco.normalize(w)

The program seems to check for whitespace to remove before removing punctuation. This causes it to return 'Car__950' rather than 'Car_950'.

ETA: added underscore in place of spaces to show effect.

Update Pypi package

The current package has become too old and should be updated. It also doesn't work with Python 2.

This is also a good moment to change the name (issue #12).

Got an error from sample code.

Hi, I tried the code in README. However I got this error. My Python is 2.7.

Traceback (most recent call last):
  File "remove-accent.py", line 5, in <module>
    from normalizr import Normalizr
  File "/usr/local/lib/python2.7/site-packages/normalizr/__init__.py", line 3, in <module>
    from normalizr.normalizr import Normalizr
  File "/usr/local/lib/python2.7/site-packages/normalizr/normalizr.py", line 7, in <module>
    import normalizr.regex as regex
ImportError: No module named regex

Stop Words suggestion

There isn't much documentation on how to use the stop-words list - and would it make sense to add the capability to use a custom stop-word list rather than having to modify an existing one? Or does that capability already exist?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.