Giter Site home page Giter Site logo

letuananh / chirptext Goto Github PK

View Code? Open in Web Editor NEW
6.0 1.0 3.0 371 KB

ChirpText is a collection of text processing tools for Python.

Home Page: https://chirptext.readthedocs.io

License: MIT License

Python 99.85% Shell 0.15%
python nlp linguistics japanese chinese vietnamese mecab

chirptext's Introduction

ChirpText is a collection of text processing tools for Python 3.

Documentation Status Total alerts Language grade: Python

It is not meant to be a powerful tank like the popular NTLK but a small package which you can pip-install anywhere and write a few lines of code to process textual data.

Main features

  • Simple file data manipulation using an enhanced open() function (txt, gz, binary, etc.)
  • CSV helper functions
  • Parse Japanese text with mecab library (Does not require mecab-python3 package even on Windows, only a binary release (i.e. mecab.exe) is required)
  • Built-in "lite" text annotation formats (texttaglib TTL/CSV and TTL/JSON)
  • Helper functions and useful data for processing English, Japanese, Chinese and Vietnamese.
  • Application configuration files management which can make educated guess about config files' whereabouts
  • Quick text-based report generation

Installation

chirptext is available on PyPI and can be installed using pip

pip install chirptext

Parsing Japanese text

chirptext supports parsing Japanese text using different parsers (mecab, Janome, and igo-python)

>>> from chirptext import deko
>>> sent = deko.parse('猫が好きです。')
>>> sent.tokens
['`猫`<0:1>', '`が`<1:2>', '`好き`<2:4>', '`です`<4:6>', '`。`<6:7>']
>>> sent.tokens.values()
['猫', 'が', '好き', 'です', '。']
>>> sent[0]
`猫`<0:1>
>>> sent[0].pos
'名詞'
>>> sent[1].lemma
'が'
>>> sent[2].reading
'スキ'

# tokenize
>>> deko.tokenize('猫が好きです。')
['猫', 'が', '好き', 'です', '。']

# split sentences
>>> deko.tokenize_sent("猫が好きです。\n犬も好きです。")
['猫が好きです。', '犬も好きです。']

# parse a document (i.e. multiple sentences)
>>> doc = deko.parse_doc("猫が好きです。\n犬も好きです。")
>>> for sent in doc:
...     print(sent, sent.tokens.values())
... 
猫が好きです。 ['猫', 'が', '好き', 'です', '。']
犬も好きです。 ['犬', 'も', '好き', 'です', '。']

Notes: At least one of the following tools must be installed to use chirptext Japanese parsing:

  1. mecab: http://taku910.github.io/mecab/#download
  2. Janome: available on PyPI, install with pip install Janome
  3. igo-python: available on PyPI, install with pip install igo-python

Convenient IO APIs

>>> from chirptext import chio
>>> chio.write_tsv('data/test.tsv', [['a', 'b'], ['c', 'd']])
>>> chio.read_tsv('data/tes.tsv')
[['a', 'b'], ['c', 'd']]

>>> chio.write_file('data/content.tar.gz', 'Support writing to .tar.gz file')
>>> chio.read_file('data/content.tar.gz')
'Support writing to .tar.gz file'

>>> for row in chio.read_tsv_iter('data/test.tsv'):
...     print(row)
... 
['a', 'b']
['c', 'd']

Sample TextReport

# a string report
rp = TextReport()  # by default, TextReport will write to standard output, i.e. terminal
rp = TextReport(TextReport.STDOUT)  # same as above
rp = TextReport('~/tmp/my-report.txt')  # output to a file
rp = TextReport.null()  # ouptut to /dev/null, i.e. nowhere
rp = TextReport.string()  # output to a string. Call rp.content() to get the string
rp = TextReport(TextReport.STRINGIO)  # same as above

# TextReport will close the output stream automatically by using the with statement
with TextReport.string() as rp:
    rp.header("Lorem Ipsum Analysis", level="h0")
    rp.header("Raw", level="h1")
    rp.print(LOREM_IPSUM)
    rp.header("Top 5 most common letters")
    ct.summarise(report=rp, limit=5)
    print(rp.content())

Output

+---------------------------------------------------------------------------------- 
| Lorem Ipsum Analysis 
+---------------------------------------------------------------------------------- 
 
Raw 
------------------------------------------------------------ 
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. 
 
Top 5 most common letters
------------------------------------------------------------ 
i: 42 
e: 37 
t: 32 
o: 29 
a: 29 

Useful links

chirptext's People

Contributors

letuananh avatar saihtaungkham avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

chirptext's Issues

Asking for a new release on PyPi

Hi,

Version 0.1a18 is a bit outdated, could you update a newer version to PyPi?

I only need this commit, but since long time is passed i think most of the master change are stable.

Revamp TTL APIs for more complex usecases

  • simplify multi-tag handling (i.e. sense candidates, chunk languages, annotators, etc.)
  • Built-in support for CoNLL
  • use first tag slot for scalar tags (i.e. POS, lemma, surface, languages)
  • Re-design TTL JSON

FileNotFoundError [WinError 2] The system cannot find the file specified

>>> from chirptext import deko
>>> sent = deko.parse('猫が好きです。')
>>> sent.tokens
===========================
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-15-bd2eed5e1672> in <module>
      1 from chirptext import deko
----> 2 sent = deko.parse('猫が好きです。')
      3 sent.tokens

~\Anaconda3\lib\site-packages\chirptext\deko.py in txt2mecab(text, **kwargs)
    250 def txt2mecab(text, **kwargs):
    251     ''' Use mecab to parse one sentence '''
--> 252     mecab_out = _internal_mecab_parse(text, **kwargs).splitlines()
    253     tokens = [MeCabToken.parse(x) for x in mecab_out]
    254     return MeCabSent(text, tokens)

~\Anaconda3\lib\site-packages\chirptext\dekomecab.py in parse(content, *args, **kwargs)
     65         return MeCab.Tagger(*args).parse(content)
     66     else:
---> 67         return run_mecab_process(content, *args, **kwargs)
     68 
     69 

~\Anaconda3\lib\site-packages\chirptext\dekomecab.py in run_mecab_process(content, *args, **kwargs)
     55     output = subprocess.run(proc_args,
     56                             input=content.encode(encoding),
---> 57                             stdout=subprocess.PIPE)
     58     output_string = os.linesep.join(output.stdout.decode(encoding).splitlines())
     59     return output_string

~\Anaconda3\lib\subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    464         kwargs['stderr'] = PIPE
    465 
--> 466     with Popen(*popenargs, **kwargs) as process:
    467         try:
    468             stdout, stderr = process.communicate(input, timeout=timeout)

~\Anaconda3\lib\subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors, text)
    767                                 c2pread, c2pwrite,
    768                                 errread, errwrite,
--> 769                                 restore_signals, start_new_session)
    770         except:
    771             # Cleanup if the child failed starting.

~\Anaconda3\lib\subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session)
   1170                                          env,
   1171                                          os.fspath(cwd) if cwd is not None else None,
-> 1172                                          startupinfo)
   1173             finally:
   1174                 # Child is launched. Close the parent's copy of those pipe

FileNotFoundError: [WinError 2] The system cannot find the file specified

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.