Giter Site home page Giter Site logo

megagonlabs / bunkai Goto Github PK

View Code? Open in Web Editor NEW
182.0 5.0 11.0 1.21 MB

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Home Page: https://pypi.org/project/bunkai/

License: Apache License 2.0

Python 98.34% Makefile 1.66%
python sentence-tokenizer sentence-boundary-detection japanese

bunkai's Introduction

Bunkai

PyPI version Python Versions License Downloads

CI Typos CodeQL Maintainability Test Coverage markdownlint jsonlint yamllint

Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts.
Bunkaiは日本語文境界判定器です.

Quick Start

Install

$ pip install -U bunkai

Disambiguation without Models

$ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎかな(笑)楽しみです★\n2文書目の先頭行です。▁改行はU+2581で表現します。' \
    | bunkai
宿を予約しました♪!│まだ2ヶ月も先だけど。│早すぎかな(笑)│楽しみです★
2文書目の先頭行です。▁│改行はU+2581で表現します。
  • Feed a document as one line by using (U+2581) for line breaks.
    1行は1つの文書を表します.文書中の改行は (U+2581) で与えてください.
  • The output shows sentence boundaries with (U+2502).
    出力では文境界は (U+2502) で表示されます.

Disambiguation for Line Breaks with a Model

If you want to disambiguate sentence boundaries for line breaks, please add a --model option with the path to the model.
改行記号に対しても文境界判定を行いたい場合は,--modelオプションを与える必要があります.

First, please install extras to use --model option.
--modelオプションを利用するために、まずextraパッケージをインストールしてください.

$ pip install -U 'bunkai[lb]'

Second, please setup a model. It will take some time.
次にモデルをセットアップする必要があります.セットアップには少々時間がかかります.

$ bunkai --model bunkai-model-directory --setup

Then, please designate the directory.
そしてモデルを指定して動かしてください.

$ echo -e "文の途中で改行を▁入れる文章ってありますよね▁それも対象です。" | bunkai --model bunkai-model-directory
文の途中で改行を▁入れる文章ってありますよね▁│それも対象です。

Morphological Analysis Result

You can get morphological analysis results with --ma option.
--maオプションを付与すると形態素解析結果が得られます.

It can be used with the --model option.
--modelオプションと同時に使えます.

$ echo -e '形態素解析し▁ます。結果を 表示します!' | bunkai --ma --model bunkai-model-directory
形態素	名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析	名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
し	動詞,自立,*,*,サ変・スル,連用形,する,シ,シ

ます	助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。	記号,句点,*,*,*,*,。,。,。
EOS
結果	名詞,副詞可能,*,*,*,*,結果,ケッカ,ケッカ
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
 	記号,空白,*,*,*,*, ,*,*
表示	名詞,サ変接続,*,*,*,*,表示,ヒョウジ,ヒョージ
し	動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
ます	助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
!	記号,一般,*,*,*,*,!,!,!
EOS

Python Library

You can also use Bunkai as Python library.
BunkaiはPythonライブラリとしても使えます.

from bunkai import Bunkai
bunkai = Bunkai()
for sentence in bunkai("はい。このようにpythonライブラリとしても使えます!"):
    print(sentence)

改行を文境界判定に含める場合はセットアップしたモデルパスを指定してください.
If you want to disambiguate line breaks too, please designate the model path where you set up.

from pathlib import Path

from bunkai import Bunkai

bunkai = Bunkai(path_model=Path("bunkai-model-directory"))
for sentence in bunkai("そうなんです▁このように▁pythonライブラリとしても▁使えます!"):
    print(sentence)

"""
Output:
そうなんです▁
このように▁pythonライブラリとしても▁使えます!
"""

For more information, see examples.
ほかの例はexamplesをご覧ください.

Documents

References

  • Yuta Hayashibe and Kensuke Mitsuzawa. Sentence Boundary Detection on Line Breaks in Japanese. Proceedings of The 6th Workshop on Noisy User-generated Text (W-NUT 2020), pp.71-75. November 2020. [PDF] [bib]

License

Apache License 2.0

bunkai's People

Contributors

dependabot[bot] avatar r-terada avatar shirayu avatar t-yamamura avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

bunkai's Issues

AttributeError: 'JanomeSubwordsTokenizer' object has no attribute 'vocab'

Getting the following error as I follow the readme. I hope what I do in what environment is already clear... Let me know if you need more info

Python 3.11.6 (main, Nov 2 2023, 04:39:40) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

from bunkai import Bunkai
from pathlib import Path
bunkai=Bunkai(path_model=Path('bunkai-model-directory'))
Traceback (most recent call last):
File "", line 1, in
File "/opt/homebrew/lib/python3.11/site-packages/bunkai/algorithm/bunkai_sbd/bunkai_sbd.py", line 78, in init
_annotators.insert(_idxs[0] + 1, LinebreakAnnotator(path_model=path_model))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/bunkai/algorithm/bunkai_sbd/annotator/linebreak_annotator.py", line 16, in init
self.linebreak_detector = Predictor(modelpath=path_model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/bunkai/algorithm/lbd/predict.py", line 43, in init
self.tokenizer = JanomeSubwordsTokenizer(self.path_tokenizer_model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/bunkai/algorithm/lbd/custom_tokenizers.py", line 136, in init
super(BertTokenizer, self).init(
File "/opt/homebrew/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 367, in init
self._add_tokens(
File "/opt/homebrew/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 467, in _add_tokens
current_vocab = self.get_vocab().copy()
^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/bert/tokenization_bert.py", line 240, in get_vocab
return dict(self.vocab, **self.added_tokens_encoder)
^^^^^^^^^^
AttributeError: 'JanomeSubwordsTokenizer' object has no attribute 'vocab'

KeyError in indirect_quote_exception_annotator.py

  File "/path/to/work/.venv/lib/python3.7/site-packages/bunkai/algorithm/bunkai_sbd/bunkai_sbd.py", line 89, in __call__
    annotations = self.eos(text)
  File "/path/to/work/.venv/lib/python3.7/site-packages/bunkai/algorithm/bunkai_sbd/bunkai_sbd.py", line 79, in eos
    rule_obj.annotate(text, annotations)
  File "/path/to/work/.venv/lib/python3.7/site-packages/bunkai/algorithm/bunkai_sbd/annotator/indirect_quote_exception_annotator.py", line 137, in a
nnotate
    index2token_obj=index2token_obj):
  File "/path/to/work/.venv/lib/python3.7/site-packages/bunkai/algorithm/bunkai_sbd/annotator/indirect_quote_exception_annotator.py", line 101, in i
s_exception_particle
    for rule_object in MORPHEMES_AFTER_CANDIDATE]):
  File "/path/to/work/.venv/lib/python3.7/site-packages/bunkai/algorithm/bunkai_sbd/annotator/indirect_quote_exception_annotator.py", line 101, in <
listcomp>
    for rule_object in MORPHEMES_AFTER_CANDIDATE]):
  File "/path/to/work/.venv/lib/python3.7/site-packages/bunkai/algorithm/bunkai_sbd/annotator/indirect_quote_exception_annotator.py", line 45, in is
_rule_valid
    if self.rule_word_surface[i_rule_morpheme] != index2token_obj[__check].word_surface:
KeyError: 107747

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.