Giter Site home page Giter Site logo

ikawaha / kagome Goto Github PK

View Code? Open in Web Editor NEW
789.0 23.0 52.0 728.16 MB

Self-contained Japanese Morphological Analyzer written in pure Go

License: MIT License

Go 95.90% Dockerfile 0.03% HTML 4.05% Procfile 0.02%
japanese tokenizer nlp-library japanese-language pos-tagging segmentation morphological-analysis korean hacktoberfest

kagome's Introduction

GoDev Go Release Coverage Status Docker Pulls

Kagome v2

Kagome is an open source Japanese morphological analyzer written in pure golang.

The dictionary/statistical models such as MeCab-IPADIC, UniDic (unidic-mecab) and so on, are able to be embedded in binaries.

Improvements from v1.

  • Dictionaries are maintained in a separate repository, and only the dictionaries you need are embedded in the binary.
  • Brushed up and added several APIs.

Dictionaries

dict source package
MeCab IPADIC mecab-ipadic-2.7.0-20070801 github.com/ikawaha/kagome-dict/ipa
UniDIC unidic-mecab-2.1.2_src github.com/ikawaha/kagome-dict/uni

Note: IPADIC is MeCab's so-called "standard dictionary" and is characterized by its ability to split morphological units more intuitively than UniDIC. In contrast, UniDIC breaks phrases into smaller example sentence units to create metadata for full-text search. For more details, see the wiki.

Experimental Features

dict source package
mecab-ipadic-NEologd mecab-ipadic-neologd github.com/ikawaha/kagome-ipa-neologd
Korean MeCab mecab-ko-dic-2.1.1-20180720 github.com/ikawaha/kagome-dict-ko

Segmentation mode for search

Kagome has segmentation mode for search such as Kuromoji.

  • Normal: Regular segmentation
  • Search: Use a heuristic to do additional segmentation useful for search
  • Extended: Similar to search mode, but also uni-gram unknown words
Untokenized Normal Search Extended
関西国際空港 関西国際空港 関西 国際 空港 関西 国際 空港
日本経済新聞 日本経済新聞 日本 経済 新聞 日本 経済 新聞
シニアソフトウェアエンジニア シニアソフトウェアエンジニア シニア ソフトウェア エンジニア シニア ソフトウェア エンジニア
デジカメを買った デジカメ を 買っ た デジカメ を 買っ た デ ジ カ メ を 買っ た

Programming example

package main

import (
	"fmt"
	"strings"

	"github.com/ikawaha/kagome-dict/ipa"
	"github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
	t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
	if err != nil {
		panic(err)
	}
	// wakati
	fmt.Println("---wakati---")
	seg := t.Wakati("すもももももももものうち")
	fmt.Println(seg)

	// tokenize
	fmt.Println("---tokenize---")
	tokens := t.Tokenize("すもももももももものうち")
	for _, token := range tokens {
		features := strings.Join(token.Features(), ",")
		fmt.Printf("%s\t%v\n", token.Surface, features)
	}
}

output:

---wakati---
[すもも も もも も もも の うち]
---tokenize---
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ

Reference

実践:形態素解析 kagome v2

Commands

Install

  • Go

    go install github.com/ikawaha/kagome/v2@latest
  • Homebrew

    # macOS and Linux (for both AMD64 and ARM64)
    brew install ikawaha/kagome/kagome
  • Docker

  • Manual Install

    • For manual installation, download and extract the appropriate archived file for your OS and architecture from the releases page.
    • Note that the extracted binary must be placed in an accessible directory with execution permission.

Usage

$ kagome -h
Japanese Morphological Analyzer -- github.com/ikawaha/kagome/v2
usage: kagome <command>
The commands are:
   [tokenize] - command line tokenize (*default)
   server - run tokenize server
   lattice - lattice viewer
   sentence - tiny sentence splitter
   version - show version

tokenize [-file input_file] [-dict dic_file] [-userdict user_dic_file] [-sysdict (ipa|uni)] [-simple false] [-mode (normal|search|extended)] [-split] [-json]
  -dict string
    	dict
  -file string
    	input file
  -json
    	outputs in JSON format
  -mode string
    	tokenize mode (normal|search|extended) (default "normal")
  -simple
    	display abbreviated dictionary contents
  -split
    	use tiny sentence splitter
  -sysdict string
    	system dict type (ipa|uni) (default "ipa")
  -udict string
    	user dict

Tokenize command

% # interactive/REPL mode
% kagome
すもももももももものうち
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS
% # piped standard input
echo "すもももももももものうち" | kagome
すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
うち    名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS
% # JSON output
% echo "" | kagome -json | jq .
[
  {
    "id": 286994,
    "start": 0,
    "end": 1,
    "surface": "猫",
    "class": "KNOWN",
    "pos": [
      "名詞",
      "一般",
      "*",
      "*"
    ],
    "base_form": "猫",
    "reading": "ネコ",
    "pronunciation": "ネコ",
    "features": [
      "名詞",
      "一般",
      "*",
      "*",
      "*",
      "*",
      "猫",
      "ネコ",
      "ネコ"
    ]
  }
]
echo "私ははにわよわわわんわん" | kagome -json | jq -r '.[].pronunciation'
ワタシ

ハニワ



ワンワン

Server command

API

Start a server and try to access the "/tokenize" endpoint.

% kagome server &
% curl -XPUT localhost:6060/tokenize -d'{"sentence":"すもももももももものうち", "mode":"normal"}' | jq .

Web App

webapp

GitHub Page: https://ikawaha.github.io/kagome/

Start a server and access http://localhost:6060. (To draw a lattice, demo application uses graphviz . You need graphviz installed.)

% kagome server &

Lattice command

A debug tool of tokenize process outputs a lattice in graphviz dot format.

% kagome lattice 私は鰻 | dot -Tpng -o lattice.png

lattice

Docker

Docker

# Compatible architectures: AMD64, Arm64, Arm32 (Arm v5, v6 and v7)
docker pull ikawaha/kagome:latest
# Interactive/REPL mode
docker run --rm -it ikawaha/kagome:latest
# Server mode (http://localhost:6060)
docker run --rm -p 6060:6060 ikawaha/kagome:latest server

Building to WebAssembly

You can see how kagome wasm works in demo site. The source code can be found in ./sample/wasm.

Licence

MIT

kagome's People

Contributors

anniezhou08 avatar deepsource-autofix[bot] avatar dependabot[bot] avatar dictav avatar hiroara avatar hurutoriya avatar ichiban avatar ikawaha avatar ingtk avatar kamatari avatar keinos avatar mattn avatar nakagami avatar nii236 avatar sztheory avatar theoremoon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kagome's Issues

JSON output option for tokenize command

I wish to have the -json option to empower the command line usage. Which lets the output in JSON.

$ # Default
$ echo "私は鰻" | kagome
私      名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
鰻      名詞,一般,*,*,*,*,鰻,ウナギ,ウナギ
EOS

$ # With -json option
$ echo "私は鰻" | kagome -json
[
{"surface":"BOS","features":null},
{"surface":"私","features":["名詞","代名詞","一般","*","*","*","私","ワタシ","ワタシ"]},
{"surface":"は","features":["助詞","係助詞","*","*","*","*","は","ハ","ワ"]},
{"surface":"鰻","features":["名詞","一般","*","*","*","*","鰻","ウナギ","ウナギ"]},
{"surface":"EOS","features":null}
]
  • Use cases
$ # Cooperate with other commands
$ echo "私は鰻" | go run . -json | jq .
[
  {
    "surface": "BOS",
    "features": null
  },
  {
    "surface": "私",
    "features": [
      "名詞",
      "代名詞",
      "一般",
      "*",
      "*",
      "*",
      "私",
      "ワタシ",
      "ワタシ"
    ]
  },
  {
    "surface": "は",
    "features": [
      "助詞",
      "係助詞",
      "*",
      "*",
      "*",
      "*",
      "は",
      "ハ",
      "ワ"
    ]
  },
  {
    "surface": "鰻",
    "features": [
      "名詞",
      "一般",
      "*",
      "*",
      "*",
      "*",
      "鰻",
      "ウナギ",
      "ウナギ"
    ]
  },
  {
    "surface": "EOS",
    "features": null
  }
]

$ echo "私は鰻" | go run . -json | jq -r '.[].features[8] | select(. != null)'
ワタシ

ウナギ

$ # TTS for example
$ echo "私は鰻" | go run . -json | jq -r '.[].features[8] | select(. != null)' | say

Sample implementation
  • kagome/cmd/tokenize/cmd.go

    Lines 158 to 171 in 92102e0

    for s.Scan() {
    sen := s.Text()
    tokens := t.Analyze(sen, mode)
    for i, size := 1, len(tokens); i < size; i++ {
    tok := tokens[i]
    c := tok.Features()
    if tok.Class == tokenizer.DUMMY {
    fmt.Printf("%s\n", tok.Surface)
    } else {
    fmt.Printf("%s\t%v\n", tok.Surface, strings.Join(c, ","))
    }
    }
    }
    return s.Err()
+	# TODO: Capture option flag
+	var flagJSONIsUp = true

-	for s.Scan() {
-		sen := s.Text()
-		tokens := t.Analyze(sen, mode)
-		for i, size := 1, len(tokens); i < size; i++ {
-			tok := tokens[i]
-			c := tok.Features()
-			if tok.Class == tokenizer.DUMMY {
-				fmt.Printf("%s\n", tok.Surface)
-			} else {
-				fmt.Printf("%s\t%v\n", tok.Surface, strings.Join(c, ","))
-			}
-		}
-	}
-	return s.Err()

+	return ScanTokens(s, t, mode, flagJSONIsUp)
  • kagome/cmd/tokenize/PrintToken.go
package tokenize

import (
	"bufio"
	"encoding/json"
	"fmt"
	"strings"

	"github.com/ikawaha/kagome/v2/tokenizer"
)

type TokenedJSON struct {
	Surface  string   `json:"surface"`
	Features []string `json:"features"`
}

func parseToJSON(surface string, features []string) ([]byte, error) {
	return json.Marshal(TokenedJSON{
		Surface:  surface,
		Features: features,
	})
}

func PrintDefault(s *bufio.Scanner, t *tokenizer.Tokenizer, mode tokenizer.TokenizeMode) error {
	for s.Scan() {
		sen := s.Text()
		tokens := t.Analyze(sen, mode)

		for i, size := 1, len(tokens); i < size; i++ {
			tok := tokens[i]
			c := tok.Features()
			if tok.Class == tokenizer.DUMMY {
				fmt.Printf("%s\n", tok.Surface)
			} else {
				fmt.Printf("%s\t%v\n", tok.Surface, strings.Join(c, ","))
			}
		}
	}

	return s.Err()
}

func PrintJSON(s *bufio.Scanner, t *tokenizer.Tokenizer, mode tokenizer.TokenizeMode) (err error) {
	var buff []byte

	fmt.Println("[") // Begin bracket

	for s.Scan() {
		sen := s.Text()
		tokens := t.Analyze(sen, mode)

		for _, tok := range tokens {
			c := tok.Features()

			if len(buff) > 0 {
				fmt.Printf("%s,\n", buff) // Print with comma
			}

			if buff, err = parseToJSON(tok.Surface, c); err != nil {
				return err
			}

		}
	}

	if s.Err() == nil {
		fmt.Printf("%s\n", buff) // Spit the last buffer w/no comma
		fmt.Println("]")         // End bracket
	}

	return s.Err()

}

func ScanTokens(s *bufio.Scanner, t *tokenizer.Tokenizer, mode tokenizer.TokenizeMode, jsonOut bool) error {
	if !jsonOut {
		return PrintDefault(s, t, mode)
	}

	return PrintJSON(s, t, mode)
}

User dictionary

To provide more flexibility, PR #72 is enable to load user dictionary from io.Reader.
It's not breaking any compatibility. It seems good.

Looking ahead, I want to build a user dictionary from data such as JSON.

Add working examples

see. #277 (comment)

./sample/
├── dict
│   └── userdict.txt
├── example      ← ※ folder for adding working examples
└── wasm
    ├── README.md
    ├── go.mod
    ├── kagome.html
    └── main.go

Long processing time when entering a large amount of text

When entering a very long sentences into kagome, the processing time will increase in proportion to the length of the sentences.
This problem does not seem to occur with mecab.

>go version                                                                                                              ✘ 2 
go version go1.16.3 darwin/amd64
> time echo GOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO | kagome
GOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 名詞,固有名詞,組織,*,*,*,*
EOS
echo   0.00s user 0.00s system 43% cpu 0.003 total
kagome  16.89s user 0.19s system 101% cpu 16.781 total
> mecab -v
mecab of 0.996


> time echo GOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO | mecab
G       名詞,固有名詞,組織,*,*,*,*
O       名詞,一般,*,*,*,*,*
O       名詞,一般,*,*,*,*,*
~~ snip ~~
O       名詞,一般,*,*,*,*,*
O       名詞,一般,*,*,*,*,*
O       名詞,一般,*,*,*,*,*
OOOOOOOOOOOOOOOOOOOOOOOOO       名詞,固有名詞,組織,*,*,*,*
EOS
echo   0.00s user 0.00s system 28% cpu 0.002 total
mecab  0.00s user 0.00s system 26% cpu 0.023 total

Build error on Google App Engine

Thank you for the great library. It is perfect to solve my challenge.

When I build a GAE app which depends on kagome, it was failed with newContents redeclaration error:

kagome/internal/dic/content_appengine.go:23:32: newContents redeclared in this block

Because both content.go and content_appengine.go declares newContents with the appengine build tag.

I guess there's no reason to separate content_appengine.go anymore since this commit: 597ca21.

JSON array goes wrong on interactive mode

As of v2.6.0, the brackets come before interaction and miss the last token like so.

$ kagome version
2.6.0
   ipa v1.0.3
   uni v1.1.2

$ kagome -json
[
すもももももももものうち
{"id":36163,"start":0,"end":3,"surface":"すもも","class":"KNOWN","pos":["名詞","一般","*","*"],"base_form":"すもも","reading":"スモモ","pronunciation":"スモモ","features":["名詞","一般","*","*","*","*","すもも","スモモ","スモモ"]},
{"id":73244,"start":3,"end":4,"surface":"も","class":"KNOWN","pos":["助詞","係助詞","*","*"],"base_form":"も","reading":"モ","pronunciation":"モ","features":["助詞","係助詞","*","*","*","*","も","モ","モ"]},
{"id":74988,"start":4,"end":6,"surface":"もも","class":"KNOWN","pos":["名詞","一般","*","*"],"base_form":"もも","reading":"モモ","pronunciation":"モモ","features":["名詞","一般","*","*","*","*","もも","モモ","モモ"]},
{"id":73244,"start":6,"end":7,"surface":"も","class":"KNOWN","pos":["助詞","係助詞","*","*"],"base_form":"も","reading":"モ","pronunciation":"モ","features":["助詞","係助詞","*","*","*","*","も","モ","モ"]},
{"id":74988,"start":7,"end":9,"surface":"もも","class":"KNOWN","pos":["名詞","一般","*","*"],"base_form":"もも","reading":"モモ","pronunciation":"モモ","features":["名詞","一般","*","*","*","*","もも","モモ","モモ"]},
{"id":55829,"start":9,"end":10,"surface":"の","class":"KNOWN","pos":["助詞","連体化","*","*"],"base_form":"の","reading":"ノ","pronunciation":"ノ","features":["助詞","連体化","*","*","*","*","の","ノ","ノ"]},
私は鰻
{"id":8027,"start":8,"end":10,"surface":"うち","class":"KNOWN","pos":["名詞","非自立","副詞可能","*"],"base_form":"うち","reading":"ウチ","pronunciation":"ウチ","features":["名詞","非自立","副詞可能","*","*","*","うち","ウチ","ウチ"]},
{"id":304999,"start":0,"end":1,"surface":"私","class":"KNOWN","pos":["名詞","代名詞","一般","*"],"base_form":"私","reading":"ワタシ","pronunciation":"ワタシ","features":["名詞","代名詞","一般","*","*","*","私","ワタシ","ワタシ"]},
{"id":57061,"start":1,"end":2,"surface":"は","class":"KNOWN","pos":["助詞","係助詞","*","*"],"base_form":"は","reading":"ハ","pronunciation":"ワ","features":["助詞","係助詞","*","*","*","*","は","ハ","ワ"]},
^Csignal: interrupt

Note the [ bracket place and first token element of "私は鰻".

The last token, "うち" with the ID 8027 of the previous sentence "すもももももももものうち" appears.

The expected behavior may be as below.

$ kagome -json
すもももももももものうち
[
{"id":36163,"start":0,"end":3,"surface":"すもも","class":"KNOWN","pos":["名詞","一般","*","*"],"base_form":"すもも","reading":"スモモ","pronunciation":"スモモ","features":["名詞","一般","*","*","*","*","すもも","スモモ","スモモ"]},
{"id":73244,"start":3,"end":4,"surface":"も","class":"KNOWN","pos":["助詞","係助詞","*","*"],"base_form":"も","reading":"モ","pronunciation":"モ","features":["助詞","係助詞","*","*","*","*","も","モ","モ"]},
{"id":74988,"start":4,"end":6,"surface":"もも","class":"KNOWN","pos":["名詞","一般","*","*"],"base_form":"もも","reading":"モモ","pronunciation":"モモ","features":["名詞","一般","*","*","*","*","もも","モモ","モモ"]},
{"id":73244,"start":6,"end":7,"surface":"も","class":"KNOWN","pos":["助詞","係助詞","*","*"],"base_form":"も","reading":"モ","pronunciation":"モ","features":["助詞","係助詞","*","*","*","*","も","モ","モ"]},
{"id":74988,"start":7,"end":9,"surface":"もも","class":"KNOWN","pos":["名詞","一般","*","*"],"base_form":"もも","reading":"モモ","pronunciation":"モモ","features":["名詞","一般","*","*","*","*","もも","モモ","モモ"]},
{"id":55829,"start":9,"end":10,"surface":"の","class":"KNOWN","pos":["助詞","連体化","*","*"],"base_form":"の","reading":"ノ","pronunciation":"ノ","features":["助詞","連体化","*","*","*","*","の","ノ","ノ"]},
{"id":8027,"start":10,"end":12,"surface":"うち","class":"KNOWN","pos":["名詞","非自立","副詞可能","*"],"base_form":"うち","reading":"ウチ","pronunciation":"ウチ","features":["名詞","非自立","副詞可能","*","*","*","うち","ウチ","ウチ"]}
]
私は鰻
[
{"id":304999,"start":0,"end":1,"surface":"私","class":"KNOWN","pos":["名詞","代名詞","一般","*"],"base_form":"私","reading":"ワタシ","pronunciation":"ワタシ","features":["名詞","代名詞","一般","*","*","*","私","ワタシ","ワタシ"]},
{"id":57061,"start":1,"end":2,"surface":"は","class":"KNOWN","pos":["助詞","係助詞","*","*"],"base_form":"は","reading":"ハ","pronunciation":"ワ","features":["助詞","係助詞","*","*","*","*","は","ハ","ワ"]},
{"id":387420,"start":2,"end":3,"surface":"鰻","class":"KNOWN","pos":["名詞","一般","*","*"],"base_form":"鰻","reading":"ウナギ","pronunciation":"ウナギ","features":["名詞","一般","*","*","*","*","鰻","ウナギ","ウナギ"]}
]
^Csignal: interrupt

This is my bad, sorry. Here's the fix and I will PR it A.S.A.P.

func printTokensInJSON(s *bufio.Scanner, t *tokenizer.Tokenizer, mode tokenizer.TokenizeMode) (err error) {
	var buff []byte

-	fmtPrintF("[\n") // Begin array bracket

	for s.Scan() {
+		fmtPrintF("[\n") // Begin array bracket

		sen := s.Text()
		tokens := t.Analyze(sen, mode)

		for _, tok := range tokens {
			if tok.ID == tokenizer.BosEosID {
				continue
			}

			if len(buff) > 0 {
				fmtPrintF("%s,\n", buff) // Print array element (JSON with comma)
			}

			if buff, err = parseTokenToJSON(tok); err != nil {
				return err
			}
		}

+		fmtPrintF("%s\n", buff) // Spit out the last buffer without comma to close the array
+		fmtPrintF("]\n")        // End array bracket
	}

-	if s.Err() == nil {
-		fmtPrintF("%s\n", buff) // Spit out the last buffer without comma to close the array
-		fmtPrintF("]\n")        // End array bracket
-	}

	return s.Err()
}

Version display support

[Feature request]

Not a strong request though, it would be nice if I can check the current local Kagome's version.

$ kagome -version
Kagome v1.7.3 (build 9f4fc8a)

Kagome works great! on:

  • macOS HighSierra (OS X 10.13.6, build 17G65)
  • Kagome : v1.7.3
  • $ go version : go version go1.10.3 darwin/amd64

Releasing the binaries in the assets (preparation for Homebrew)

Feature request

TL; DR

It would be nice if one can download the compiled binary from the releases.

TS; DR

This is a feature request as a preparation to let Kagome command (binary) be downloaded via Homebrew, the package manager for macOS, Linux, and Windows Subsystem for Linux (WSL2).

Not to compete or go against MeCab but I would love to see Kagome in Homebrew as well.

brew install kagome

To do so, we need a repo with the name homebrew-kagome and place the formula.

I can help to make the formula, BUT before all, we need the official binaries to be released in the assets in order to get the proper hash value which is required to include in the formula.

So, since we have a GitHub action that tests on all Linux, Windows, and macOS in the dispatch workflow, I think it's not that difficult to add go build directions for each OS and release them.

I would like to hear your opinions. What do you think about it?

Refs

proposal: fluent accesibility for features

Thank you for providing useful library. I want more fluent accessibility to Token attributes to simplify the workflow.

For example:

func main() {
    t := tokenizer.New()
    tokens := t.Tokenize("寿司が食べたい。")
    for _, token := range tokens {
        if token.Class == tokenizer.DUMMY {
            continue
        }
        if len(token.Features()) >= 1 && token.Features()[0] == "名詞" {
            fmt.Printf("%v\n", token)
        }
    }
}

if Token.Pos() is implemented, do like that:

func main() {
    t := tokenizer.New()
    tokens := t.Tokenize("寿司が食べたい。")
    for _, token := range tokens {
        // interface to access Pos, Yomi, or etc..
        if token.Features().Pos() == "名詞" {
            // ...
        }

        // or simply
        if token.Pos() == "名詞" {
            // ...
        }
    }
}

How about providing Pos() string, Token() []string etc. on Token? Token.Features() []string should provide features on any dictionary classes, so is it difficult to provide that interface? ( I'm not familiar with Morphological Analyzer, so I'm not certain that it's enable to deal features by the same way UserDic, SysDic, and Unknown. )

[Question] Comparison to other morphological analyzers

I am very interested in this project and consider incorporating it into my application.

I am not sure if this is the right place to ask, but I am curious how this compares to other morphological analyzers like MeCab JUMAN++. Do you have any comparisons or know of some?

Make Dockerfile compatible with ARM architecture

[Enhancement]

Happy New Year!

Currently (as of 3641b3f), the Dockerfile creates an image for AMD64 (Intel-compatible) binary with GOARCH=amd64 option.

kagome/Dockerfile

Lines 21 to 32 in 3641b3f

RUN apk --no-cache add git && \
version_app=$(git describe --tag) && \
echo "- Current git tag: ${version_app}" && \
go version && \
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build \
-a \
-installsuffix cgo \
--ldflags "-w -s -extldflags \"-static\" -X 'main.version=${version_app}'" \
-o /go/bin/kagome \
./cmd/kagome && \
echo '- Running tests ...' && \
/go/bin/kagome version

It would be nice to provide Dockerfile for ARM architecture as well. Such as RaspberryPi 3 (ARMv7) and Raspberry Pi Zero (ARMv6).

Seems that Golang supports ARM architecture and has some success stories on Raspberry Pies.

If I may, as soon as I success to make one, I will PR the Dockerfile. But please keep your expectations low. ;-)

Too much memory allocation in CommonPrefixSearch

Hi @ikawaha san. I try to run kagome on long running purpose, and find some memory allocation in tokenizer.

-> % go test -v ./tokenizer -run=^$ -bench=. -benchmem -benchtime=2s -memprofile=prof.mem | tee mem.0
PASS
BenchmarkAnalyzeNormal-8           20000            140704 ns/op           19179 B/op        581 allocs/op
BenchmarkAnalyzeSearch-8           20000            185009 ns/op           19178 B/op        581 allocs/op
BenchmarkAnalyzeExtended-8         20000            180695 ns/op           18240 B/op        582 allocs/op
ok      github.com/ikawaha/kagome/tokenizer     15.453s

And reading it by pprof

-> % go tool pprof --alloc_space tokenizer.test prof.mem
Entering interactive mode (type "help" for commands)
(pprof) top
1699.93MB of 1711.81MB total (99.31%)
Dropped 19 nodes (cum <= 8.56MB)
Showing top 10 nodes out of 45 (cum >= 14.28MB)
      flat  flat%   sum%        cum   cum%
  796.54MB 46.53% 46.53%  1046.54MB 61.14%  github.com/ikawaha/kagome/internal/dic.IndexTable.CommonPrefixSearch
  406.58MB 23.75% 70.28%  1454.63MB 84.98%  github.com/ikawaha/kagome/tokenizer.Tokenizer.Analyze
     250MB 14.60% 84.89%      250MB 14.60%  github.com/ikawaha/kagome/internal/da.DoubleArray.CommonPrefixSearch
  114.91MB  6.71% 91.60%   114.91MB  6.71%  bytes.makeSlice
   59.99MB  3.50% 95.11%    59.99MB  3.50%  strings.genSplit
      47MB  2.75% 97.85%       47MB  2.75%  encoding/binary.Read
   10.34MB   0.6% 98.46%    31.84MB  1.86%  github.com/ikawaha/kagome/internal/da.Read
    8.98MB  0.52% 98.98%    68.97MB  4.03%  github.com/ikawaha/kagome/internal/dic.NewContents
    3.31MB  0.19% 99.17%    16.81MB  0.98%  github.com/ikawaha/kagome/internal/dic.LoadConnectionTable
    2.28MB  0.13% 99.31%    14.28MB  0.83%  github.com/ikawaha/kagome/internal/dic.LoadMorphSlice

And see allocation in each line.

(pprof) list CommonPrefixSearch
Total: 1.67GB
ROUTINE ======================== github.com/ikawaha/kagome/internal/da.DoubleArray.CommonPrefixSearch in /Users/ke-suzuki/src/github.com/ikawaha/kagome/internal/da/da.go
     250MB      250MB (flat, cum) 14.60% of Total
         .          .    101:           if q >= bufLen || int(d[q].Check) != p {
         .          .    102:                   break
         .          .    103:           }
         .          .    104:           ahead := int(d[q].Base) + int(terminator)
         .          .    105:           if ahead < bufLen && int(d[ahead].Check) == q && int(d[ahead].Base) <= 0 {
     123MB      123MB    106:                   ids = append(ids, int(-d[ahead].Base))
     127MB      127MB    107:                   lens = append(lens, i+1)
         .          .    108:           }
         .          .    109:   }
         .          .    110:   return
         .          .    111:}
         .          .    112:
ROUTINE ======================== github.com/ikawaha/kagome/internal/dic.IndexTable.CommonPrefixSearch in /Users/ke-suzuki/src/github.com/ikawaha/kagome/internal/dic/index.go
  796.54MB     1.02GB (flat, cum) 61.14% of Total
         .          .     61:}
         .          .     62:
         .          .     63:// CommonPrefixSearch finds keywords sharing common prefix in an input
         .          .     64:// and returns the ids and it's lengths if found.
         .          .     65:func (idx IndexTable) CommonPrefixSearch(input string) (lens []int, ids [][]int) {
         .      250MB     66:   seeds, lens := idx.Da.CommonPrefixSearch(input)
         .          .     67:   for _, id := range seeds {
         .          .     68:           dup, _ := idx.Dup[int32(id)]
  351.02MB   351.02MB     69:           list := make([]int, 1+dup, 1+dup)
         .          .     70:           for i := 0; i < len(list); i++ {
         .          .     71:                   list[i] = id + i
         .          .     72:           }
  445.52MB   445.52MB     73:           ids = append(ids, list)
         .          .     74:   }
         .          .     75:   return
         .          .     76:}
         .          .     77:
         .          .     78:// Search finds the given keyword and returns the id if found.

For reducing allocation, it is prefer to use append without creating new slice. Do you have any ideas to reduce allocation in DoubleArray.CommonPrefixSearch and IndexTable.CommonPrefixSearch ? For example,

func (d DoubleArray) CommonPrefixSearch(input string) (ids, lens []int) {
    var p, q int
    bufLen := len(d)
    ids = make([]int, 0, SOME_FIXED_SIZE)
    lens = make([]int, 0, SOME_FIXED_SIZE)
    for i, size := 0, len(input); i < size; i++ {
        p = q
        q = int(d[p].Base) + int(input[i])
        if q >= bufLen || int(d[q].Check) != p {
            break
        }
        ahead := int(d[q].Base) + int(terminator)
        if ahead < bufLen && int(d[ahead].Check) == q && int(d[ahead].Base) <= 0 {
            ids = append(ids, int(-d[ahead].Base))
            lens = append(lens, i+1)
        }
    }
    return
}

may be reduce allocation in many cases, but I'm not sure how SOME_FIXED_SIZE should be defined.

Do not use unsafe

It seems not much effect of the code using usafe, so remove such the code.

Is there a way to get the readings for each character

Thanks again for this awesome project and your help!

I got started using kagome and now I am wondering if there is a way to get the readings of individual characters. For example:

input: 日本経済新聞
output: 日 - に; 本 - ほん;  経 - けい; 済 - ざい; 新 - しん; 聞 - ぶん; 

instead of

input: 日本経済新聞
output: にほんけいざいしんぶん

How to build kegome v2 on web?

I want to host kagome v2 for korean and Japanese tokenizer on git page.

I know the main file is in sample/demo.html...
Looking at the example, "dic" is in Japanese and Chinese.
I want to change the example to Japanese and Korean.
How to do it?

[homebrew] "bottle : unneeded is deprecated" warning

Not a bug of Kagome but a warning of brew because of goreleaser.

As of Kagome v2.7.0 + Homebrew 3.3.1 combination, a warning "Calling bottle :unneeded is deprecated" appears.

$ brew info kagome
Warning: Calling bottle :unneeded is deprecated! There is no replacement.
Please report this issue to the ikawaha/kagome tap (not Homebrew/brew or Homebrew/core):
  /usr/local/Homebrew/Library/Taps/ikawaha/homebrew-kagome/Formula/kagome.rb:9

ikawaha/kagome/kagome: stable 2.7.0
Self-contained Japanese Morphological Analyzer written in pure Go.
https://github.com/ikawaha/kagome
/usr/local/Cellar/kagome/2.7.0 (5 files, 61.5MB) *
  Built from source on 2021-09-17 at 09:10:04
From: https://github.com/ikawaha/homebrew-kagome/blob/HEAD/Formula/kagome.rb

$ brew doctor
Warning: Calling bottle :unneeded is deprecated! There is no replacement.
Please report this issue to the ikawaha/kagome tap (not Homebrew/brew or Homebrew/core):
  /usr/local/Homebrew/Library/Taps/ikawaha/homebrew-kagome/Formula/kagome.rb:9

Your system is ready to brew.

It seems to be related to the below issue of goreleaser. And fixed in goreleaser v0.183.0.

So, probably we have to just wait for a little to be reflected in goreleaser action.


Env info
$ brew --version
Homebrew 3.3.1
Homebrew/homebrew-core (git revision ec823dbf70f; last commit 2021-10-27)
Homebrew/homebrew-cask (git revision 876d3165c6; last commit 2021-10-27)

$ sw_vers
ProductName:	Mac OS X
ProductVersion:	10.15.7
BuildVersion:	19H1519

Too much memory allocation in case of Shift_JIS input

I'm using kagome to parse the text parts of html files (very useful, thanks!). Some files contains Shift_JIS characters, and kagome does not return the output for these files (consumes much memory).

Sample input:

echo "日本語" | nkf -s | kagome

When using a user dictionary, how to split kanji

I'm using a user dictionary, an entry:

朝顔,朝 顔,あさ かお,あさ かお

I'm trying to split 朝顔 into 朝 and 顔. So they come as two different entries.

How do I achieve this?

Thanks

romaji transliteration

is it possible to use the cmdline tool for interactive romaji transliteration?
e.g.

$ kagome -romaji
ローマ字変換プログラム作ってみた。
Roma ji henkan program tsukutte mita.

(same as cutlet)

Dependabot can't resolve your Go dependency files

Dependabot can't resolve your Go dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:


If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

View the update logs.

[GitHub Actions] Docker build failed

https://github.com/ikawaha/kagome/actions/runs/4335647896/jobs/7570442414

Pull back the image failed:

Run docker system prune -f -a && \
  docker system prune -f -a && \
  docker pull ***/kagome:arm32v6 && \
  docker pull ***/kagome:arm32v7 && \
  docker pull ***/kagome:arm64 && \
  docker pull ***/kagome:amd64
  shell: /usr/bin/bash -e {0}
  env:
    NAME_IMAGE: ***/kagome
Total reclaimed space: 0B
no matching manifest for linux/amd64 in the manifest list entries
arm32v6: Pulling from ***/kagome
Error: Process completed with exit code 1.

ikawaha_kagome_general___Docker_Hub

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.