Giter Site home page Giter Site logo

duydo / elasticsearch-analysis-vietnamese Goto Github PK

View Code? Open in Web Editor NEW
494.0 34.0 204.0 4.57 MB

Vietnamese Analysis Plugin for Elasticsearch

License: Apache License 2.0

Java 94.97% Dockerfile 5.03%
java elasticsearch vietnamese-analysis-plugin vietnamese analysis-plugin

elasticsearch-analysis-vietnamese's Introduction

Vietnamese Analysis Plugin for Elasticsearch

Test

Vietnamese Analysis plugin integrates Vietnamese language analysis into Elasticsearch. It uses C++ tokenizer for Vietnamese library developed by CocCoc team for their Search Engine and Ads systems.

The plugin provides vi_analyzer analyzer, vi_tokenizer tokenizer and vi_stop stop filter. The vi_analyzer is composed of the vi_tokenizer tokenizer, stop and lowercase filter.

Example output

GET _analyze
{
  "analyzer": "vi_analyzer",
  "text": "Cộng hòa Xã hội chủ nghĩa Việt Nam"
}

The above sentence would produce the following terms:

["cộng hòa", "xã hội", "chủ nghĩa" ,"việt nam"]

Configuration

The vi_analyzer analyzer accepts the following parameters:

  • dict_path The path to tokenizer dictionary on system. Defaults to /usr/local/share/tokenizer/dicts.

  • keep_punctuation Keep punctuation marks as tokens. Defaults to false.

  • split_url If it's enabled (true), a domain duydo.me is split into ["duy", "do", "me"]. If it's disabled (false) duydo.me is split into ["duydo", "me"]. Defaults to false.

  • stopwords A pre-defined stop words list like _vi_ or an array containing a list of stop words. Defaults to stopwords.txt file.

  • stopwords_path The path to a file containing stop words.

Example configuration

In this example, we configure the vi_analyzer analyzer to keep punctuation marks and to use the custom list of stop words:

PUT my-vi-index-00001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_vi_analyzer": {
          "type": "vi_analyzer",
          "keep_punctuation": true,
          "stopwords": ["rất", "những"]
        }
      }
    }
  }
}

GET my-vi-index-00001/_analyze
{
  "analyzer": "my_vi_analyzer",
  "text": "Công nghệ thông tin Việt Nam rất phát triển trong những năm gần đây."
}

The above example produces the following terms:

["công nghệ", "thông tin", "việt nam", "phát triển", "trong", "năm", "gần đây", "."]

We can also create a custom analyzer with the vi_tokenizer. In following example, we create my_vi_analyzer to produce both diacritic and no diacritic tokens in lowercase:

PUT my-vi-index-00002
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_vi_analyzer": {
          "tokenizer": "vi_tokenizer",
          "filter": [
            "lowercase",
            "ascii_folding"
          ]
        }
      },
      "filter": {
        "ascii_folding": {
          "type": "asciifolding",
          "preserve_original": true
        }
      }
    }
  }
}

GET my-vi-index-00002/_analyze
{
  "analyzer": "my_vi_analyzer",
  "text": "Cộng hòa Xã hội chủ nghĩa Việt Nam"
}

The above example produces the following terms:

["cong hoa", "cộng hòa", "xa hoi", "xã hội", "chu nghia", "chủ nghĩa", "viet nam", "việt nam"]

Use Docker

Make sure you have installed both Docker & docker-compose

Build the image with Docker Compose

# Copy, edit ES version and password for user elastic in file .env. Default password: changeme
cp .env.sample .env
docker compose build
docker compose up

Verify

curl -k http://elastic:changeme@localhost:9200/_analyze -H 'Content-Type: application/json' -d '
{
  "analyzer": "vi_analyzer",
  "text": "Cộng hòa Xã hội chủ nghĩa Việt Nam"
}'

# Output
{"tokens":[{"token":"cộng hòa","start_offset":0,"end_offset":8,"type":"<WORD>","position":0},{"token":"xã hội","start_offset":9,"end_offset":15,"type":"<WORD>","position":1},{"token":"chủ nghĩa","start_offset":16,"end_offset":25,"type":"<WORD>","position":2},{"token":"việt nam","start_offset":26,"end_offset":34,"type":"<WORD>","position":3}]}                                                                                     

Build from Source

Step 1: Build C++ tokenizer for Vietnamese library

git clone https://github.com/duydo/coccoc-tokenizer.git
cd coccoc-tokenizer && mkdir build && cd build
cmake -DBUILD_JAVA=1 ..
make install
# Link the coccoc shared lib to /usr/lib
sudo ln -sf /usr/local/lib/libcoccoc_tokenizer_jni.* /usr/lib/

By default, the make install installs:

  • The lib commands tokenizer, dict_compiler and vn_lang_tool under /usr/local/bin
  • The dynamic lib libcoccoc_tokenizer_jni.so under /usr/local/lib/. The plugin uses this lib directly.
  • The dictionary files under /usr/local/share/tokenizer/dicts. The plugin uses this path for dict_path by default.

Verify

/usr/local/bin/tokenizer "Cộng hòa Xã hội chủ nghĩa Việt Nam"
# cộng hòa	xã hội	chủ nghĩa	việt nam

Refer the repo for more information to build the library.

Step 2: Build the plugin

Clone the plugin’s source code:

git clone https://github.com/duydo/elasticsearch-analysis-vietnamese.git

Optionally, edit the elasticsearch-analysis-vietnamese/pom.xml to change the version of Elasticsearch (same as plugin version) you want to build the plugin with:

...
<version>8.7.0</version>
...

Build the plugin:

cd elasticsearch-analysis-vietnamese
mvn package

Step 3: Installation the plugin on Elasticsearch

bin/elasticsearch-plugin install file://target/releases/elasticsearch-analysis-vietnamese-8.7.0.zip

Compatible Versions

From v7.12.11, the plugin uses CocCoc C++ tokenizer instead of the VnTokenizer by Lê Hồng Phương, I don't maintain the plugin with the VnTokenizer anymore, if you want to continue developing with it, refer the branch vntokenizer.

Vietnamese Analysis Plugin Elasticsearch
master 8.7.0
develop 8.7.0
8.7.0 8.7.0
8.4.0 8.4.0 ~ 8.7.1
8.0.0 8.0.0 ~ 8.0.x
7.16.1 7.16 ~ 7.17.1
7.12.1 7.12.1 ~ 7.15.x
7.3.1 7.3.1
5.6.5 5.6.5
5.4.1 5.4.1
5.3.1 5.3.1
5.2.1 5.2.1
2.4.1 2.4.1
2.4.0 2.4.0
2.3.5 2.3.5
2.3.4 2.3.4
2.3.3 2.3.3
2.3.2 2.3.2
2.3.1 2.3.1
2.3.0 2.3.0
0.2.2 2.2.0
0.2.1.1 2.1.1
0.2.1 2.1.0
0.2 2.0.0
0.1.7 1.7+
0.1.6 1.6+
0.1.5 1.5+
0.1.1 1.4+
0.1 1.3

Issues

You might get errors during starting Elasticsearch with the plugin

1. Error: java.lang.UnsatisfiedLinkError: no libcoccoc_tokenizer_jni in java.library.path ... (reported in 102)

It happens because of your JVM cannot find the dynamic lib libcoccoc_tokenizer_jni in java.library.path, try to resolve by doing one of following options:

  • Appending /usr/local/lib into environment variable LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
  • Making a symbolic link or copying the file /usr/local/lib/libcoccoc_tokenizer_jni.so to /usr/lib :
# Make link
ln -sf /usr/local/lib/libcoccoc_tokenizer_jni.so /usr/lib/libcoccoc_tokenizer_jni.so

# Copy 
cp /usr/local/lib/libcoccoc_tokenizer_jni.so /usr/lib

2. Error: Cannot initialize Tokenizer: /usr/local/share/tokenizer/dicts (reported in 106)

It happens because of the tokenizer cannot find the dictionary files under /usr/local/share/tokenizer/dicts. Ensure the path /usr/local/share/tokenizer/dicts existed and includes those files: alphabetic, i_and_y.txt, nontone_pair_freq_map.dump, syllable_trie.dump d_and_gi.txt, multiterm_trie.dump, numeric. If not, try to build the C++ tokenizer (Step 1) again.

Thanks to

License

This software is licensed under the Apache 2 license, quoted below.

Copyright by Duy Do

Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.

elasticsearch-analysis-vietnamese's People

Contributors

dependabot[bot] avatar dungvo avatar duydo avatar hdhoang avatar imcvampire avatar kubeplusplus avatar nguyentienlong avatar taivc-teko avatar trantienduchn avatar tritoanst avatar tyrantkhan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-analysis-vietnamese's Issues

Can not build custom version elasticsearch

Hi @duydo!
I cloned your project, then tried to build with this tut http://duydo.me/how-to-build-elasticsearch-vietnamese-analysis-plugin/.
But have two error:

ERROR 0.00s | VietnameseAnalysisIntegrationTest (suite) <<<
Throwable #1: java.lang.RuntimeException: found jar hell in test classpath

ERROR 0.00s | VietnameseAnalysisTest.initializationError <<<
Throwable #1: java.lang.NoClassDefFoundError: org.elasticsearch.test.ESTestCase
Pls tell me how to fix this problems? Thanks

Elasticsearch 5.0

Hi a.Duy Do,
I've got this issue when installing your plugin into Elasticsearch 5.0
Exception in thread "main" java.net.UnknownHostException: usr
Please help me to release a new version plugin which is compatible with Elasticsearch 5.
Thanks,
Mai Truong

Error when bulk indexing database using vnTokenizer

I found this error in log file when doing bulk index database using the plugin. Any suggestion to debug this ?
ES 5.6.6, and plugin info

qvB_lVV analysis-icu                      5.6.5
qvB_lVV elasticsearch-analysis-vietnamese 5.6.5
[2018-11-15T21:31:37,514][DEBUG][o.e.a.b.TransportShardBulkAction] [qvB_lVV] [lalafood_2018_11_16][0] failed to execute bulk item (index) BulkShardRequest [[lalafood_2018_11_16][0]] containing [21] requests
java.lang.StringIndexOutOfBoundsException: String index out of range: 8
	at java.lang.String.substring(String.java:1963) ~[?:1.8.0_181]
	at vn.hus.nlp.tokenizer.Tokenizer.getNextToken(Tokenizer.java:469) ~[?:?]
	at vn.hus.nlp.tokenizer.Tokenizer.tokenize(Tokenizer.java:214) ~[?:?]
	at org.apache.lucene.analysis.vi.VietnameseTokenizer.tokenize(VietnameseTokenizer.java:94) ~[?:?]
	at org.apache.lucene.analysis.vi.VietnameseTokenizer.reset(VietnameseTokenizer.java:142) ~[?:?]
	at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:742) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:447) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:403) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1571) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1316) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:662) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:606) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:504) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:557) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:546) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.executeIndexRequestOnPrimary(TransportShardBulkAction.java:492) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:146) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:115) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:70) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:975) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:944) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:113) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:345) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:270) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:924) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:921) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:151) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationLock(IndexShard.java:1659) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryShardReference(TransportReplicationAction.java:933) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction.access$500(TransportReplicationAction.java:92) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:291) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:266) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:248) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:654) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

Can't apply elasticsearch-analysis-vietnamese

Hello Duy Do
I'm Student.
I researching about elasticsearch search enginer. I try your solution for VietNamese stopwords but i can't using it. It's example.

  1. I try with example from elasticsearch.
    PUT /index3
    {
    "settings": {
    "number_of_shards": 1,
    "analysis": {
    "analyzer": {
    "my_analyzer": {
    "type": "standard",
    "stopwords": [
    "bị",
    "và"
    ]
    }
    }
    }
    }
    }

then
GET /index3/_analyze?analyzer=my_analyzer&text="anh bị đi bộ đội và phải nghỉ học"

result:
{

"tokens": [
    {
        "token": "anh",
        "start_offset": 1,
        "end_offset": 4,
        "type": "<ALPHANUM>",
        "position": 1
    }
    ,
    {
        "token": "đi",
        "start_offset": 8,
        "end_offset": 10,
        "type": "<ALPHANUM>",
        "position": 3
    }
    ,
    {
        "token": "bộ",
        "start_offset": 11,
        "end_offset": 13,
        "type": "<ALPHANUM>",
        "position": 4
    }
    ,
    {
        "token": "đội",
        "start_offset": 14,
        "end_offset": 17,
        "type": "<ALPHANUM>",
        "position": 5
    }
    ,
    {
        "token": "phải",
        "start_offset": 21,
        "end_offset": 25,
        "type": "<ALPHANUM>",
        "position": 7
    }
    ,
    {
        "token": "nghỉ",
        "start_offset": 26,
        "end_offset": 30,
        "type": "<ALPHANUM>",
        "position": 8
    }
    ,
    {
        "token": "học",
        "start_offset": 31,
        "end_offset": 34,
        "type": "<ALPHANUM>",
        "position": 9
    }
]

}

is this True.

but i try with your example.
PUT index4
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "vi_tokenizer"
}
}
}
}
}

and GET /index4/_analyze?analyzer=my_analyzer&text="anh bị đi bộ đội và phải nghỉ học"
result:
{

"tokens": [
    {
        "token": "anh",
        "start_offset": 1,
        "end_offset": 4,
        "type": "word",
        "position": 2
    }
    ,
    {
        "token": "bị",
        "start_offset": 4,
        "end_offset": 6,
        "type": "word",
        "position": 4
    }
    ,
    {
        "token": "đi",
        "start_offset": 6,
        "end_offset": 8,
        "type": "word",
        "position": 6
    }
    ,
    {
        "token": "bộ đội",
        "start_offset": 8,
        "end_offset": 14,
        "type": "word",
        "position": 8
    }
    ,
    {
        "token": "và",
        "start_offset": 14,
        "end_offset": 16,
        "type": "word",
        "position": 10
    }
    ,
    {
        "token": "phải",
        "start_offset": 16,
        "end_offset": 20,
        "type": "word",
        "position": 12
    }
    ,
    {
        "token": "nghỉ",
        "start_offset": 20,
        "end_offset": 24,
        "type": "word",
        "position": 14
    }
    ,
    {
        "token": "học",
        "start_offset": 24,
        "end_offset": 27,
        "type": "word",
        "position": 16
    }
]

}
I don'n understand. this it not work same example stopwords from elasticsearch.
Can you explain for me?

Thanks very much!

Error when index database using analysis-vietnamese

First thanks to your plugin.
But i have some error when index database in it.
I use elasticsearch 1.7.0, and install success elasticsearch because test success
curl -XGET "http://localhost:9200/_analyze?analyzer=vi_analyzer&text="Công nghệ thông tin Việt Nam"

i tried set analyzer is defaul when creat new index , setting in elasticsearch.yml

############################## Index

index.analysis.analyzer.default.type : "vi_analyzer"

but if i tried index database, error happen:

[2015-11-03 17:21:28,194][DEBUG][action.bulk ] [Jarella] [users][4] failed to execute bulk item (index) index {[users][user][72], source[{"id":72,"user_name":"chùa Thầy Việt Nam Hà Tây","password":"Tây Hà","sex":"male","photo_id":1,"@Version":"1","@timestamp":"2015-11-03T10:21:26.234Z"}]}
java.lang.NullPointerException
at org.apache.lucene.analysis.vi.VietnameseTokenizer.(VietnameseTokenizer.java:74)
at org.apache.lucene.analysis.vi.VietnameseTokenizer.(VietnameseTokenizer.java:62)
at org.apache.lucene.analysis.vi.VietnameseAnalyzer.createComponents(VietnameseAnalyzer.java:83)
at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:113)
at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:113)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:182)
at org.apache.lucene.document.Field.tokenStream(Field.java:554)
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:611)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:359)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:318)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:465)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1526)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1252)
at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:432)
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:364)
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:511)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:413)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:148)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

But if try
$ curl -XPUT 'http://localhost:9200/users/user/72' -d '{
"user_name":"chùa Thầy Việt Nam Hà Tây",
"password":"Tây Hà",
"sex":"male",
"photo_id":1,
"@Version":"1",
"@timestamp":"2015-11-03T10:21:26.234Z"
}'

everything is ok ! Mapping properties still "type" : "string" and "analyzer" : "vi_analyzer"( hidden because is defaul, but can test)
I tried install plugin in version1.6 with elastic1.6 and issue still happen.

Thanks for your help!

Capitalization plus non-Vietnamese characters leads to inconsistent tokenization

I see that the tokenizer splits on certain non-Vietnamese Latin characters, such as å, ä, č, ç, ë, ğ, ï, ı, ñ, ö, ö, ø, š, ş, ü, and ÿ (which is not a complete list, I expect, just the ones I tested). This is not ideal for text that contains foreign words, but it is also not uncommon for analyzers to split on foreign characters.

I also see that adjacent capitalized words are tokenized together, including names. This is also not always ideal in cases like International Plant Names Index, but it is understandable.

However, these two interact unexpectedly; if the second letter of a capitalized word is a foreign character, it doesn't split. For example, the Spanish word año (which means "year") is tokenized as a + ñ + o, like this:

{
  "tokens" : [
    {
      "token" : "a",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "phrase",
      "position" : 0
    },
    {
      "token" : "ñ",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "residual",
      "position" : 1
    },
    {
      "token" : "o",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "phrase",
      "position" : 2
    }
  ]
}

However, the capitalized form, Año is tokenized as añ + o, like this:

{
  "tokens" : [
    {
      "token" : "añ",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "allcaps",
      "position" : 0
    },
    {
      "token" : "o",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "phrase",
      "position" : 1
    }
  ]
}

It seems that capitalization should not change the tokenization of a single word.

I'm also seeing similar behavior with other splitting characters: r=22r + 22, but R=22r= + 22. Similarly, x&y vs X&Y, x*y vs X*Y, x+y vs X+Y, and others.

Error when searching

ES 5.6.5

Settings

 "number_of_shards": "1",
        "provided_name": "lalafood_2018_11_16_1_shard",
        "creation_date": "1542330255235",
        "analysis": {
          "analyzer": {
            "lala_analyzer": {
              "filter": [
                "icu_folding"
              ],
              "char_filter": [
                "html_strip"
              ],
              "tokenizer": "vi_tokenizer"
            }
          }
        },
        "number_of_replicas": "1",

Error log:

[2018-11-16T01:22:54,863][WARN ][r.suppressed             ] path: /lalafood_latest/item/464077, params: {index=lalafood_latest, id=464077, type=item, timeout=15m}
java.lang.StringIndexOutOfBoundsException: String index out of range: 12
	at java.lang.String.substring(String.java:1963) ~[?:1.8.0_181]
	at vn.hus.nlp.tokenizer.Tokenizer.getNextToken(Tokenizer.java:469) ~[?:?]
	at vn.hus.nlp.tokenizer.Tokenizer.tokenize(Tokenizer.java:214) ~[?:?]
	at org.apache.lucene.analysis.vi.VietnameseTokenizer.tokenize(VietnameseTokenizer.java:94) ~[?:?]
	at org.apache.lucene.analysis.vi.VietnameseTokenizer.reset(VietnameseTokenizer.java:142) ~[?:?]
	at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:742) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:447) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:403) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1571) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1316) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:662) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:606) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:504) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:557) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:546) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.executeIndexRequestOnPrimary(TransportShardBulkAction.java:492) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:146) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:115) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:70) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:975) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:944) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:113) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:345) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:270) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:924) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:921) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:151) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationLock(IndexShard.java:1659) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryShardReference(TransportReplicationAction.java:933) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction.access$500(TransportReplicationAction.java:92) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:291) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:266) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:248) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:654) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]


Hỏi về kiểm tra bản ghi đã tồn tại

Chào mọi người,

Mình mới nghiên cứu về ES. Phiên bản của mình dùng mới nhất hiện tại.

Mình dùng elasticsearch-php để index và get dữ liệu.

Mình muốn kiểm tra 1 bản ghi trước khi chèn vào, ví dụ như chưa có thì chèn, có rồi bỏ qua.

Mong mọi người hỗ trợ.

NPE

Hi duydo,

I'm studying about Vietnamese tokenizing. I installed plugin version 0.1.1 with the latest ES. When using vi_analyzer with the sample suggester at http://www.elastic.co/guide/en/elasticsearch/reference/1.3/search-suggesters-phrase.html, I got 2 NPE (tokenizer provider cannt get an instance). Here they are:

java.lang.NullPointerException
    at vn.hus.nlp.tokenizer.Tokenizer.tokenize(Tokenizer.java:251)
    at org.apache.lucene.analysis.vi.VietnameseTokenizer.tokenize(VietnameseTokenizer.java:83)
    at org.apache.lucene.analysis.vi.VietnameseTokenizer.reset(VietnameseTokenizer.java:141)
    at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70)
    at org.elasticsearch.search.suggest.SuggestUtils.analyze(SuggestUtils.java:126)
    at org.elasticsearch.search.suggest.phrase.NoisyChannelSpellChecker.getCorrections(NoisyChannelSpellChecker.java:66)
    at org.elasticsearch.search.suggest.phrase.PhraseSuggester.innerExecute(PhraseSuggester.java:99)
    at org.elasticsearch.search.suggest.phrase.PhraseSuggester.innerExecute(PhraseSuggester.java:53)
    at org.elasticsearch.search.suggest.Suggester.execute(Suggester.java:42)
    at org.elasticsearch.search.suggest.SuggestPhase.execute(SuggestPhase.java:85)
    at org.elasticsearch.search.suggest.SuggestPhase.execute(SuggestPhase.java:74)
    at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:170)
    at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
    at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
    at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:231)
    at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:228)
    at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

và đoạn NPE thứ 2

java.lang.NullPointerException
    at org.apache.lucene.analysis.vi.VietnameseTokenizer.<init>(VietnameseTokenizer.java:74)
    at org.apache.lucene.analysis.vi.VietnameseTokenizer.<init>(VietnameseTokenizer.java:62)
    at org.elasticsearch.indices.analysis.VietnameseIndicesAnalysis$1.create(VietnameseIndicesAnalysis.java:53)
    at org.elasticsearch.index.analysis.CustomAnalyzer.createComponents(CustomAnalyzer.java:83)
    at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:113)
    at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:113)
    at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:144)
    at org.elasticsearch.search.suggest.phrase.NoisyChannelSpellChecker.tokenStream(NoisyChannelSpellChecker.java:143)
    at org.elasticsearch.search.suggest.phrase.PhraseSuggester.innerExecute(PhraseSuggester.java:96)
    at org.elasticsearch.search.suggest.phrase.PhraseSuggester.innerExecute(PhraseSuggester.java:53)
    at org.elasticsearch.search.suggest.Suggester.execute(Suggester.java:42)
    at org.elasticsearch.search.suggest.SuggestPhase.execute(SuggestPhase.java:85)
    at org.elasticsearch.search.suggest.SuggestPhase.execute(SuggestPhase.java:74)
    at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:170)
    at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
    at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
    at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:231)
    at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:228)
    at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Many patterns need better word boundaries

  1. Slashes: In general slashes and dashes are treated unexpectedly based on how date-like or fraction-like they look in inappropriate contexts. Parsing of dates is fine, but not everything that looks like a date is, and MM/DD/YYYY format is not parsed. Parsing fractions is fine, too, but both dates and fractions should have proper word boundaries around them:
15/11/1910 => 15/11/1910
11/15/1910 => 11/15 + 1910
10/10/10 => 10/10/10
10/30/10 => 10/30 + 10
10/10/10/10/10/10 => 10/10/10 + 10/10/10
10/10/10/10/40/10 => 10/10/10 + 10/40 + 10
  1. Dashes: Again, parsing dates is fine, though not everything that looks like a data is, and that causes inconsistencies. MM-DD-YYYY is not matched. Better word boundary detection would make more sense.
15-11-1910 => 15-11-1910
11-15-1910 => 11-15 + -1910
10-10-10 => 10-10-10
10-30-10 => 10-30 + -10
10-10-10-10-10-10 => 10-10-10 + -10 + -10 + -10
10-40-10-10-10-10 => 10-40 + -10 + -10 + -10 + -10
10-000-10-10-10-10 => 10-00 + 0 + -10 + -10 + -10 + -10

0-915826-22-4 => 0 + -915826 + -22 + -4
2-915826-22-4 => 2-9158 + 26-2 + 2-4

x-1y => x -1 y
  1. URL domain matching is too aggressive: Domain matching doesn't limit itself to word boundaries, and so can break up tokens oddly, especially when there are accented characters:
Daily.ngày -> Daily.ng + ày
aaa.eee.iii -> aaa.eee.iii
aáa.eee.iii -> aáa + eee.iii
aaa.eée.iii -> aaa.e + ée + iii
aaa.eee.iíi -> aaa.eee.i + íi

Support Elastic search for 5.6.x?

Hi, this is great plugin. I download it for elastic search 5.6.0. but get the following issue

Exception in thread "main" java.lang.IllegalArgumentException: plugin [elasticsearch-analysis-vietnamese] is incompatible with version [5.6.0]; was designed for version [5.4.1]

lỗi khi build mvn package trong window

em chạy lệnh đến lẹnh
cd elasticsearch-analysis-vietnamese
mvn package
thì bị lỗi sau:
[ERROR] Failed to execute goal com.carrotsearch.randomizedtesting:junit4-maven-plugin:2.3.3:junit4 (unit-tests) on project elasticsearch-analysis-vietnamese: There were test failures: 2 suites, 1 test, 1 suite-level error, 1 error [seed: 2073BF2290744A13]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

nếu e bỏ qua bước này và làm 2 cách như sau mà vẫn không tích hợp được ạ:
cách 1: tiếp tục thực hiện trong cmd:
C:\elasticsearch-5.4.1\bin\elasticsearch-plugin install C:\elasticsearch-analysis-vietnamese-5.4.1.zip
file .zip download: https://github.com/duydo/elasticsearch-analysis-vietnamese/releases (file chứa nhiều file java)
thì lỗi ERROR: Unknown plugin C:\elasticsearch-analysis-vietnamese-5.4.1.zip
cách 2: giải nén elasticsearch-analysis-vietnamese-5.4.1.zip (file chứa nhiều file java) ra được thư mục elasticsearch. em copy và đổi tên thư mục này thành "elasticsearch-analysis-vietnamese" rồi paste vào thư mục plugin trong "C:\elasticsearch-5.4.1\plugins" thì vẫn không dùng được.

Cannot Index Document on 2.2.0

Hi,
I installed version 2.2.0 of the plugin into elasticsearch 2.2.0 and installation (and also index creation) worked fine. For this index I set the default analyzer like this:

"default": {
  "type": "vi_analyzer"
}

Now, when I try to index a simple document like this:

PUT /vi_index/system/1
{
  "test": "15"
}

I get error:

{
    "error" : {
        "root_cause" : [{
                "type" : "remote_transport_exception",
                "reason" : "[Kasper Cole][127.0.0.1:9300][indices:data/write/index[p]]"
            }
        ],
        "type" : "null_pointer_exception",
        "reason" : null
    },
    "status" : 500
}

and when I run the same functionality from Java, I get a more detailed error:

java.lang.NullPointerException
    at org.apache.lucene.analysis.vi.VietnameseTokenizer.<init>(VietnameseTokenizer.java:74)
    at org.apache.lucene.analysis.vi.VietnameseTokenizer.<init>(VietnameseTokenizer.java:62)
    at org.apache.lucene.analysis.vi.VietnameseAnalyzer.createComponents(VietnameseAnalyzer.java:83)
    at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
    at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
    at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:176)
    at org.apache.lucene.document.Field.tokenStream(Field.java:562)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:607)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1256)
    at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:539)
    at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:468)
    at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:567)
    at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
    at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:236)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:157)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:65)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:595)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:263)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:260)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:744)

Can you please check this issue?
Thanks!

Not work when create index with mapping

When I create a new index , specifying that the "content" field use the vi_analyzer analyzer:

PUT http://localhost:9200/test/
{
"mappings": {
"data": {
"properties": {
"content": {
"type": "string",
"analyzer": "vi_analyzer"
}
}
}
}
}
}

Then I index the follow document :

POST http://localhost:9200/test/data/
{
"content": "Công nghệ thông tin Việt Nam",
"id": 100
}

However, when I search

POST http://localhost:9200/test/data/_search/
{
"query": {
"match": {
"content": "Công nghệ thông tin"
}
}

I received that result:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 4,
"failed": 1,
"failures": [
{
"index": "test",
"shard": 0,
"status": 400,
"reason": "SearchParseException[[test][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"match":{"content":"Công nghệ thông tin"}}}]]]; nested: NullPointerException; "
}
]
},
"hits": {
"total": 0,
"max_score": null,
"hits": [ ]
}
}

In elasticsearch console, I received an exception:

org.elasticsearch.search.SearchParseException: [test][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"match":{"content":"Công nghệ thông tin"}}}]]
at org.elasticsearch.search.SearchService.parseSource(SearchService.java:660)
at org.elasticsearch.search.SearchService.createContext(SearchService.java:516)
at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:488)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:257)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:206)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:203)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at java.util.regex.Matcher.getTextLength(Matcher.java:1234)
at java.util.regex.Matcher.reset(Matcher.java:308)
at java.util.regex.Matcher.(Matcher.java:228)
at java.util.regex.Pattern.matcher(Pattern.java:1088)
at vn.hus.nlp.tokenizer.Tokenizer.getNextToken(Tokenizer.java:438)
at vn.hus.nlp.tokenizer.Tokenizer.tokenize(Tokenizer.java:214)
at org.apache.lucene.analysis.vi.VietnameseTokenizer.tokenize(VietnameseTokenizer.java:83)
at org.apache.lucene.analysis.vi.VietnameseTokenizer.reset(VietnameseTokenizer.java:141)
at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70)
at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70)
at org.apache.lucene.analysis.util.FilteringTokenFilter.reset(FilteringTokenFilter.java:111)
at org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:208)
at org.apache.lucene.util.QueryBuilder.createBooleanQuery(QueryBuilder.java:87)
at org.elasticsearch.index.search.MatchQuery.parse(MatchQuery.java:210)
at org.elasticsearch.index.query.MatchQueryParser.parse(MatchQueryParser.java:165)
at org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:239)
at org.elasticsearch.index.query.IndexQueryParserService.innerParse(IndexQueryParserService.java:342)
at org.elasticsearch.index.query.IndexQueryParserService.parse(IndexQueryParserService.java:268)
at org.elasticsearch.index.query.IndexQueryParserService.parse(IndexQueryParserService.java:263)
at org.elasticsearch.search.query.QueryParseElement.parse(QueryParseElement.java:33)
at org.elasticsearch.search.SearchService.parseSource(SearchService.java:644)
... 9 more

Am I missing something ? My elasticsearch version is 1.3.4 (I've tried GET /_analyze?analyzer=standard&text="Công nghệ thông tin Việt Nam", it worked correctly but not work when use on mapping). Sorry for my bad English.

More capitalization inconsistencies

[I've got a number of issues, which I'm going to file separately, based on grouping related symptoms, but some of them may be related. I think a number of them may be related to matching patterns without good boundary detection.]

Below are some more examples (as in issue #26) of strings that differ only by case but are tokenized differently.

a1b2c3 => a 1 b 2 c 3
1a2b3c => 1 a 2 b 3 c

A1B2C3 => a1 b2 c3
1A2B3C => 1 a2 b3 c

aa1bbb24cccc369 => aa 1 bbb 24 cccc 369
AA1BBB24CCCC369 => AA1 BBB2 4 CCCC3 69

a_b => a b
A_b => A_ b
a_ => a
A_ => a_

1000x1000 => 1000 x 1000
1000X1000 => 1000 x1 000

X.Jones => x.j ones
X.jones => x jones
x.Jones => x jones
x.jones => x jones

X.Y.Jones => x.y.j ones
X.Y.jones => x.y jones
X.y.jones => x y jones

XX.YY.JJones => xx.yy.jjones

Exception Khi Truy Vấn Dữ Liệu

Em chào anh, rất cảm ơn về bộ thư viện hữu ích.

Hiện tại em đang chạy thử nghiệm thư viện của anh tuy nhiên mỗi lần truy vấn các node es em thử nghiệm đề có Exception bắn ra, lỗi bắn ra khá ngẫu nhiên với từng node mặc dù query giống nhau. Anh có thể xem ảnh em đính kèm. Có thể vì lỗi này mà kết quả trả ra cho mỗi query cũng không giống nhau.

Em đang chạy thử với ES 2.4.4 lucene 5.5.2, OpenJDK 8, Debian 8 với khoảng 200 nghìn dữ liệu địa chỉ. Nếu anh có thời gian và cần tái hiện lại lỗi có thể liên hệ với em qua mail khoimt47ATgmail.com để lấy dữ liệu mẫu. Em cũng muốn debug nhưng chưa biết nên bắt đầu từ đâu.

Cảm ơn anh.

http://imgur.com/a/oJZ9r
http://imgur.com/a/xJE47

Elasticsearch error log:

[DEBUG][action.search ] [Chase Stein] [address-mapping-de][2], node[Okfd0cdjTGasSGietUUG8g], [P], v[2], s[STARTED], a[id=vqeDdbV-SqmPJlUfa1r9wQ]: Failed to execute [org.elasticsearch.action.search.SearchRequest@51525c16] lastShard [true]
RemoteTransportException[[Chase Stein][10.0.2.15:9300][indices:data/read/search[phase/query]]]; nested: SearchParseException[failed to parse search source [{
"query": {
"query_string": {
// "fields" : ["pn", "dn", "pn.folded", "dn.folded"],
"query": "12 2A đường 18 pid:126 did:1074",
"analyzer": "vi_analyzer"
}
}
}
]]; nested: NullPointerException;
Caused by: SearchParseException[failed to parse search source [{
"query": {

  "query_string": {
     // "fields" : ["pn", "dn", "pn.folded", "dn.folded"],
     "query": "12 2A đường 18 pid:126 did:1074",
     "analyzer": "vi_analyzer"
  }

}
}
]]; nested: NullPointerException;
at org.elasticsearch.search.SearchService.parseSource(SearchService.java:873)
at org.elasticsearch.search.SearchService.createContext(SearchService.java:667)
at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:633)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:377)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:77)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:378)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

MapperParsingException

Hi,
I am trying to use the plugin however I did not manage to make it work.

After installation from binary source it seems to correctly load:

[2014-12-16 11:17:05,552][INFO ][node                     ] [elasticsearch] version[1.3.4], pid[3794], build[a70f3cc/2014-09-30T09:07:17Z]
[2014-12-16 11:17:05,553][INFO ][node                     ] [elasticsearch] initializing ...
[2014-12-16 11:17:05,577][INFO ][plugins                  ] [elasticsearch] loaded [marvel, jdbc-1.3.4.4-d2e33c3, analysis-vietnamese], sites [marvel]

However when I try to set a custom in the mapping of a new index as:

"keyword_account": {"fields": {"raw": {"index": "not_analyzed", "type": "string"}, 
"en": {"type": "string", "analyzer": "my_english"}, 
"th": {"type": "string", "analyzer": "thai"}, 
"splitted": {"type": "string", "analyzer": "my_analyzer"}, 
"id": {"type": "string", "analyzer": "indonesian"}, 
"vn": {"type": "string", "analyzer": "vi_analyzer"}
}, 
"type": "string"}

I get the following response:

 u'MapperParsingException[Analyzer [vi_analyzer] not found for field [vn]]

any clue?

Highlighting Issue (v. 2.2.0)

Thank you for the plugin, but I have an issue with highlighting of matched terms. I use version 2.2.0.
When I run the following query:

{
  "query": {
    "bool": {
      "should": {
        "multi_match": {
          "query": "chữ",
          "fields": [
            "message",
            "user"
          ],
          "analyzer": "default_search"
        }
      }
    }
  },
  "highlight": {
    "pre_tags": [
      "<b>"
    ],
    "post_tags": [
      "</b>"
    ],
    "fragment_size": 0,
    "number_of_fragments": 0,
    "require_field_match": false,
    "fields": {
      "message": {},
      "user": {}
    }
  }
}

I get the following result, where the highlighting offsets are wrong and they wrap the wrong words:

{
    "took" : 14,
    "timed_out" : false,
    "_shards" : {
        "total" : 200,
        "successful" : 200,
        "failed" : 0
    },
    "hits" : {
        "total" : 1,
        "max_score" : 0.030578919,
        "hits" : [{
                "_index" : "fts-vietnamese",
                "_type" : "Document",
                "_id" : "AVKcb6Xy0-uCokJzleqC",
                "_score" : 0.030578919,
                "_source" : {
                    "streamId" : 1,
                    "language" : "vietnamese",
                    "message" : "Có một vấn đề là khi sent text messages dùng tiếng Việt hoặc email qua người khác, chữ tiếng Việt bị mất dấu hoặc mất chữ. Chẳng hạn như chữ “ôm” thì thành",
                    "doc_id" : "VietnameseWords"
                },
                "highlight" : {
                    "message" : [
                        "Có một vấn đề là khi sent text messages dùng tiếng Việt hoặc email <b>qua</b> người khác, chữ tiếng V<b>iệt</b> bị mất dấu h<b>oặc</b> mất chữ. Chẳng hạn như chữ “ôm” thì thành"
                    ]
                }
            }
        ]
    }
}

Can you please advise whether this is a bug that you can fix or something that I can configure in my code.
Thank you!

Settings sample for filter stopwords

I've just written settings but I'm not sure about that.

analysis: {
               analyzer: {
                   email: {
                       tokenizer: 'keyword',
                       filter: ['lowercase']
                   },
                   vietnamese: {
                       tokenizer: 'vi_analyzer',
                       filter: ['lowercase', 'vi_stop']
                   }
               }
           } 

Could you correct me with I was wrong?

Unexpected whitespace causes errors

I've found three cases where the unexpected presence or absence of whitespace causes offset error or string index out of range error. I'm reporting all three together, since I'm guessing they are related.

  1. Two newlines in a row causes an error:
curl -s localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "vi", "text" : "x\n\ny" }'

{
  "error" : {
    "root_cause" : [
      {
        "type" : "remote_transport_exception",
        "reason" : "[K5DTwrD][127.0.0.1:9300][indices:admin/analyze[s]]"
      }
    ],
    "type" : "string_index_out_of_bounds_exception",
    "reason" : "String index out of range: -1"
  },
  "status" : 500
}
  1. Two spaces between elements that should tokenize together causes an error. In this case "không gian" is normally indexed as one token. But if it has two spaces between "không" and "gian" it causes an error:
curl -s localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "vi", "text" : "không  gian"}'

{
  "error" : {
    "root_cause" : [
      {
        "type" : "remote_transport_exception",
        "reason" : "[K5DTwrD][127.0.0.1:9300][indices:admin/analyze[s]]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "startOffset must be non-negative, and endOffset must be >= startOffset, startOffset=-1,endOffset=9"
  },
  "status" : 400
}
  1. No space between elements that should tokenize together causes an error. In this case, "năm 6" usually gets tokenized together, but if there's no space in there, I think it still gets split into two tokens, but the lack of space between causes an error:
curl -s localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "vi", "text" : "năm6"}'

{
  "error" : {
    "root_cause" : [
      {
        "type" : "remote_transport_exception",
        "reason" : "[K5DTwrD][127.0.0.1:9300][indices:admin/analyze[s]]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "startOffset must be non-negative, and endOffset must be >= startOffset, startOffset=-1,endOffset=4"
  },
  "status" : 400
}

analyzer [vi_analyzer] not found for field [content]

Hi duydo,
I have problem with this plugin.
I am using:
elasticsearch-analysis-vietnamese-2.4.0.zip
and
elasticsearch_2.4

FOR VI_TOKENIZER
it ok when using for analyzer
#####################
POST _analyze
{
"tokenizer": "vi_tokenizer",
"text": "giá cả thế nào hả bạn"
}
RESULT
{
"tokens": [
{
"token": "giá cả",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "thế nào",
"start_offset": 7,
"end_offset": 14,
"type": "word",
"position": 1
},
{
"token": "hả",
"start_offset": 15,
"end_offset": 17,
"type": "word",
"position": 2
},
{
"token": "bạn",
"start_offset": 18,
"end_offset": 21,
"type": "word",
"position": 3
}
]
}
#######################
but when i am using this for build customize analyzer this CAN'T find VI_TOKENIZER

PUT resource_message
{
"settings": {
"analysis": {
"analyzer": {
"vi_customize_analyzer": {
"type": "custom",
"tokenizer": "vi_tokenizer"
}
}
}
}
}
RESULT IS:
{
"error": {
"root_cause": [
{
"type": "remote_transport_exception",
"reason": "[Air-Walker][10.240.0.2:9300][indices:admin/create]"
}
],
"type": "illegal_argument_exception",
"reason": "Custom Analyzer [vi_customize_analyzer] failed to find tokenizer under name [vi_tokenizer]"
},
"status": 400
}
######################################

FOR VI_ANALYZER
i am continue to using VI_ANALYZER. It good for analyzer request
POST _analyze
{
"analyzer": "vi_analyzer",
"text": "giá cả thế nào hả bạn"
}
RESULT
{
"tokens": [
{
"token": "giá cả",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "thế nào",
"start_offset": 7,
"end_offset": 14,
"type": "word",
"position": 1
},
{
"token": "hả",
"start_offset": 15,
"end_offset": 17,
"type": "word",
"position": 2
},
{
"token": "bạn",
"start_offset": 18,
"end_offset": 21,
"type": "word",
"position": 3
}
]
}

but when i'm using this for mapping it fail
PUT resource_message
{
"mappings": {
"data": {
"properties": {
"content": {
"type": "string",
"analyzer": "vi_analyzer"
}
}
}
}
}
RESULT
{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "analyzer [vi_analyzer] not found for field [content]"
}
],
"type": "mapper_parsing_exception",
"reason": "Failed to parse mapping [data]: analyzer [vi_analyzer] not found for field [content]",
"caused_by": {
"type": "mapper_parsing_exception",
"reason": "analyzer [vi_analyzer] not found for field [content]"
}
},
"status": 400
}

I don't know what step i wrong. Can you help me ? :(

Lỗi khi search elasticsearch

Sau khi đánh index và insert dữ liệu vào, khi search thì vẫn ra kết quả nhưng nhìn log thì có lỗi này:

_> java.security.AccessControlException: access denied ("java.io.FilePermission" "tokenizer.log.lck" "write")

at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
at java.security.AccessController.checkPermission(AccessController.java:884)
at java.lang.SecurityManager.checkPermission(SecurityManager.java:549)
at java.lang.SecurityManager.checkWrite(SecurityManager.java:979)
at sun.nio.fs.UnixChannelFactory.open(UnixChannelFactory.java:247)
at sun.nio.fs.UnixChannelFactory.newFileChannel(UnixChannelFactory.java:136)
at sun.nio.fs.UnixChannelFactory.newFileChannel(UnixChannelFactory.java:148)
at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:175)
at java.nio.channels.FileChannel.open(FileChannel.java:287)
at java.nio.channels.FileChannel.open(FileChannel.java:335)
at java.util.logging.FileHandler.openFiles(FileHandler.java:459)
at java.util.logging.FileHandler.(FileHandler.java:292)
at vn.hus.nlp.tokenizer.Tokenizer.createLogger(Tokenizer.java:167)
at vn.hus.nlp.tokenizer.Tokenizer.(Tokenizer.java:134)
at vn.hus.nlp.tokenizer.TokenizerProvider.(TokenizerProvider.java:58)
at vn.hus.nlp.tokenizer.TokenizerProvider.getInstance(TokenizerProvider.java:108)_
_

Caused by: java.lang.NullPointerException at vn.hus.nlp.tokenizer.Tokenizer.tokenize(Tokenizer.java:251) at org.apache.lucene.analysis.vi.VietnameseTokenizer.tokenize(VietnameseTokenizer.java:88) at org.apache.lucene.analysis.vi.VietnameseTokenizer.reset(VietnameseTokenizer.java:141) at org.apache.lucene.analysis.CachingTokenFilter.reset(CachingTokenFilter.java:58) at org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:222) at org.apache.lucene.util.QueryBuilder.createBooleanQuery(QueryBuilder.java:87) at org.elasticsearch.index.search.MatchQuery.parse(MatchQuery.java:176) at org.elasticsearch.index.search.MultiMatchQuery.parseAndApply(MultiMatchQuery.java:54) at org.elasticsearch.index.search.MultiMatchQuery.access$000(MultiMatchQuery.java:41) at org.elasticsearch.index.search.MultiMatchQuery$QueryBuilder.parseGroup(MultiMatchQuery.java:120) at org.elasticsearch.index.search.MultiMatchQuery$QueryBuilder.buildGroupedQueries(MultiMatchQuery.java:111)
Anh có biết nguyên nhân do đâu không ạ? Em cảm ơn!

Error when build plugin from source

Hi,
the first thanks to the nice plugin.
I have a problem. i build plugin from your source, when i run elasticsearch have an error:

{
"error": "NoClassDefFoundError[vn/hus/nlp/sd/SentenceDetectorFactory]; nested: ClassNotFoundException[vn.hus.nlp.sd.SentenceDetectorFactory]; ",
"status": 500
}

This is my source
screen shot 2015-10-21 at 11 36 54 pm

Plz help me,
Thanks you!

Token offsets are incorrect

I had to build the plugin myself for Elasticsearch v5.3.2.

If I analyze this string sách .. sách ; 東京都 I get the following:

{
  "tokens" : [
    {
      "token" : "sách",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "phrase",
      "position" : 0
    },
    {
      "token" : "sách",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "phrase",
      "position" : 3
    },
    {
      "token" : "東",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "residual",
      "position" : 7
    },
    {
      "token" : "京",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "residual",
      "position" : 11
    },
    {
      "token" : "都",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "residual",
      "position" : 15
    }
  ]
}

The start and end offsets are incorrect and do not map correctly back into the original string. Each start_offset is just one more than the end_offset of the previous token. This is incorrect when there is more than one character or less than one character between tokens.

I believe that this would be the correct output:

{
  "tokens" : [
    {
      "token" : "sach",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "phrase",
      "position" : 0
    },
    {
      "token" : "sach",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "phrase",
      "position" : 1
    },
    {
      "token" : "東",
      "start_offset" : 15,
      "end_offset" : 16,
      "type" : "residual",
      "position" : 2
    },
    {
      "token" : "京",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "residual",
      "position" : 3
    },
    {
      "token" : "都",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "residual",
      "position" : 4
    }
  ]
}

Note that the position values are also not correct.

plugin maven compile error

I'm following http://duydo.me/how-to-build-elasticsearch-vietnamese-analysis-plugin/
but it fails on mac osx or ubuntu equally - can you advise? THks

[INFO] Execution time total: 8.51 sec.
[INFO] Tests summary: 2 suites, 6 tests, 3 failures
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10.978 s
[INFO] Finished at: 2017-08-24T15:06:47-07:00
[INFO] Final Memory: 23M/265M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal com.carrotsearch.randomizedtesting:junit4-maven-plugin:2.3.3:junit4 (unit-tests) on project elasticsearch-analysis-vietnamese: There were test failures: 2 suites, 6 tests, 3 failures [seed: 575B65543AFBCF2B] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal com.carrotsearch.randomizedtesting:junit4-maven-plugin:2.3.3:junit4 (unit-tests) on project elasticsearch-analysis-vietnamese: There were test failures: 2 suites, 6 tests, 3 failures [seed: 575B65543AFBCF2B]
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
	at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
	at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
	at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
	at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
	at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
	at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
	at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
	at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
	at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.MojoExecutionException: There were test failures: 2 suites, 6 tests, 3 failures [seed: 575B65543AFBCF2B]
	at com.carrotsearch.maven.plugins.junit4.JUnit4Mojo.execute(JUnit4Mojo.java:528)
	at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
	... 20 more
[ERROR]
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException```

Gặp vấn đề khi sử dụng bulk để update field sử dụng vi_analyzer

Elasticsearch & plugin version: 2.4.1

Mình đang dùng bulk để update 1 field sang từ standard sang sử dụng vi_analyzer - khoảng 400,000 document thì elasticsearch phun ra lỗi này liên tục:

java.lang.NullPointerException
        at vn.hus.nlp.tokenizer.Tokenizer.tokenize(Tokenizer.java:251)
        at org.apache.lucene.analysis.vi.VietnameseTokenizer.tokenize(VietnameseTokenizer.java:88)
        at org.apache.lucene.analysis.vi.VietnameseTokenizer.reset(VietnameseTokenizer.java:141)
        at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70)
        at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70)
        at org.apache.lucene.analysis.util.FilteringTokenFilter.reset(FilteringTokenFilter.java:67)
        at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:630)
        at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:365)
        at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
        at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477)
        at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:538)
        at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:454)
        at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
        at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
        at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnReplica(TransportIndexAction.java:195)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:437)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:68)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:401)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:299)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:291)
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:77)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:293)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)```

Suggestion: better handle combining diacritics

Combining characters (incluing diacritics and other characters in non-Latin scripts) cause tokens to split. Some examples from various scripts:

  • ضَمَّة — Arabic
  • বাংলা — Bengali
  • Михаи́л — Cyrillic + combining acute
  • दिल्ली — Devanagari
  • ಅಕ್ಷರಮಾಲೆ — Kannada
  • áa - Latin + combining acute (vs áa with precomposed character)
  • ଓଡ଼ିଆ — Oriya
  • தமிழ் — Tamil
  • తెలుగు — Telugu
  • อักษรไทย —Thai
  • g͡b — International Phonetic Alphabet

Can it be used with elasticsearch 2.2?

If I try
bin/plugin install https://dl.dropboxusercontent.com/u/1598491/elasticsearch-analysis-vietnamese-0.2.1.1.zip
I get:
ERROR: Plugin [elasticsearch-analysis-vietnamese] is incompatible with Elasticsearch [2.2.0]. Was designed for version [2.1.1]

thanks!

Error Code 500 crash internally

Dataset include a field with empty string, which seem to make it crash internally.

logstash | [2017-09-28T13:13:39,840][INFO ][logstash.outputs.elasticsearch] Retrying individual bulk actions that failed or were rejected by the previous bulk request. {:count=>1}
logstash | [2017-09-28T13:13:43,799][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 500 ({"type"=>"string_index_out_of_bounds_exception", "reason"=>"String index out of range: 3"})

.tar.gz release file

Could you, please, provide .tar.gz version of binary in addition to zip?

I am making a docker image with automated build and want to use release URL directly. However unzip is not available in official container and I don't really want to install/uninstall it.

Very slow run time

I don't know if you can do anything about it, but in my tests, processing Vietnamese Wikipedia articles in 100-line batches, the analyzer is 30x to 40x slower than the default analyzer.

Processing 5,000 articles (just running the analysis, not full indexing) took ~0:17 for the default analyzer. For vi_analyzer on the same text, it took ~8:05. For comparison, I ran the English analyzer on the same text, and it also took ~0:17. These are on my laptop running on a virtual machine, so the measurements aren't super precise, but I did run several batches of different sizes (100 articles, 1,000 articles, and 5,000 articles) and the differences are in the same 30x-40x range, with smaller batches being comparatively slower. Somewhere in the 3x-5x range for complex analysis might be bearable, but 30x may be too much.

Do you have any way to do profiling to see if there are any obvious speed ups?

[Last issue for today. Sorry for the onslaught of issues. I wanted to share all the stuff I found, because I'd like to try this plugin and see how our users like it! Thanks for all your work on the plugin!]

Contextual inconsistencies

Seemingly unrelated characters around a token can sometimes change how it is parsed in unexpected ways.

  1. Dash-connected-words are split differently depending on other words and punctuation:
xxx-yyy-zzz => xxx-yyy-zzz
w xxx-yyy-zzz => w + xxx + yyy-zzz

but:

. xxx-yyy-zzz => xxx-yyy-zzz
w. xxx-yyy-zzz => w + xxx-yyy-zzz
w, xxx-yyy-zzz => w + xxx-yyy-zzz
w- xxx-yyy-zzz => w + xxx-yyy-zzz

Any of these characters can come after the letter and preserve the dashed-connected-words: . + - ; : ( ) , % ! ?

  1. Tabs are treated differently depending on what's around them:
x\ty => x\ty

but:

x\ty z => z + y + z
z x\ty => z + x + y

Number errors and inconsistencies

  1. Integers followed by a close paren, and then any character but space or return will make the paren part of the number token.
(10) => 10
(10), => 10)
(10)e => 10)  e

but:

(10.2) => 10.2
(10.2), => 10.2
  1. Periods after numbers before spaces and tabs are not stripped:
10.\n10.\t10. 10. => 10   10.   10.   10

but:

2.10.\n2.10.\t2.10. 2.10. => 2.10   2.10   2.10   2.10
  1. Period after number is inconsistently stripped:
10. => 10
10._ => 10.
10., => 10.
10.. => 10.

Suggestion: Don't index whitespace and punctuation

These whitespace and zero-width characters get indexed, but probably shouldn't be:

  • no break space U+00A0
  • en space U+2002
  • em space U+2003
  • three-per-em space U+2004
  • four-per-em space U+2005
  • six-per-em space U+2006
  • figure space U+2007
  • punctuation space U+2008
  • thin space U+2009
  • hair space U+200A
  • zero width space U+200B
  • zero width non-joiner U+200C
  • zero width joiner U+200D
  • left-to-right mark U+200E
  • right-to-left mark U+200F
  • narrow no-break-space U+202F
  • medium mathematical space U+205F
  • zero width no-break space U+FEFF

These punctuation characters get indexed, too, but probably don't need to be:

  • §
  • «
  • ·
  • »
  • ‐ (hyphen U+2010, not -, which is "hyphen-minus")
  • – (en dash)
  • — (em dash)
  • ′ (prime)
  • ″ (double prime)
  • ( (fullwidth paren)
  • ) (fullwidth paren)
  • . (fullwidth period)
  • : (fullwidth colon)

Repeated Tokens Get Incorrect Offsets

Repeated tokens get incorrect offsets, especially in the presence of extra whitespace and whitespace-like characters. I found examples with spaces and with a right-to-left mark followed by a space. I left out the right-to-left mark example because it's invisible, but you can recreate the behavior by changing the first space to a right-to-left mark in #2 or #3 below. I haven't tested any other space-like characters.

  1. no extra spaces, everything indexed as expected.
curl -s localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "vi", "text" : "a b a b"}'

{
  "tokens" : [
    { "token" : "a", "start_offset" : 0, "end_offset" : 1, "type" : "<PHRASE>", "position" : 0 },
    { "token" : "b", "start_offset" : 2, "end_offset" : 3, "type" : "<PHRASE>", "position" : 1 },
    { "token" : "a", "start_offset" : 4, "end_offset" : 5, "type" : "<PHRASE>", "position" : 2 },
    { "token" : "b", "start_offset" : 6, "end_offset" : 7, "type" : "<PHRASE>", "position" : 3 }
  ]
}
  1. one extra leading space; both "b" tokens have offsets 3-4:
curl -s localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "vi", "text" : " a b a b"}'

{
  "tokens" : [
    { "token" : "a", "start_offset" : 1, "end_offset" : 2, "type" : "<PHRASE>", "position" : 0 },
    { "token" : "b", "start_offset" : 3, "end_offset" : 4, "type" : "<PHRASE>", "position" : 1 },
    { "token" : "a", "start_offset" : 5, "end_offset" : 6, "type" : "<PHRASE>", "position" : 2 },
    { "token" : "b", "start_offset" : 3, "end_offset" : 4, "type" : "<PHRASE>", "position" : 3 }
  ]
}
  1. two extra leading spaces; both "a" tokens have offsets 2-3, and both "b" tokens have offsets 4-5.
curl -s localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "vi", "text" : "  a b a b"}'

{
  "tokens" : [
    { "token" : "a", "start_offset" : 2, "end_offset" : 3, "type" : "<PHRASE>", "position" : 0 },
    { "token" : "b", "start_offset" : 4, "end_offset" : 5, "type" : "<PHRASE>", "position" : 1 },
    { "token" : "a", "start_offset" : 2, "end_offset" : 3, "type" : "<PHRASE>", "position" : 2 },
    { "token" : "b", "start_offset" : 4, "end_offset" : 5, "type" : "<PHRASE>", "position" : 3 }
  ]
}
  1. A real life example from Vietnamese Wikipedia, showing more long-distance duplicates. Sorry if the text doesn't make sense. I edited out a bit of Arabic script, which also had a right-to-left mark in it, which is not visible. Here I've added an asterisk (*) before the lines with incorrect offsets.
curl -s localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "vi", "text" : "  ULY: Lop Nahiyisi, UPNY: Lop Nah̡iyisi ? ; giản thể: 洛浦县; bính âm: Luòpǔ xiàn, Hán  Abraxas friedrichi là một loài bướm đêm trong họ Geometridae. Dữ liệu liên quan tới Abraxas friedrichi tại Wikispecies"}'

{
  "tokens" : [
    { "token" : "uly",         "start_offset" : 2, "end_offset" : 5, "type" : "<PHRASE>", "position" : 0 },
    { "token" : "lop",         "start_offset" : 7, "end_offset" : 10, "type" : "<PHRASE>", "position" : 1 },
    { "token" : "nahiyisi",    "start_offset" : 11, "end_offset" : 19, "type" : "<PHRASE>", "position" : 2 },
    { "token" : "upny",        "start_offset" : 21, "end_offset" : 25, "type" : "<PHRASE>", "position" : 3 },
*   { "token" : "lop",         "start_offset" : 7, "end_offset" : 10, "type" : "<PHRASE>", "position" : 4 },
*   { "token" : "nah",         "start_offset" : 11, "end_offset" : 14, "type" : "<FOREIGN>", "position" : 5 },
    { "token" : "̡",           "start_offset" : 34, "end_offset" : 35, "type" : "<OTHER>", "position" : 6 },
*   { "token" : "iyisi",       "start_offset" : 14, "end_offset" : 19, "type" : "<PHRASE>", "position" : 7 },
    { "token" : "giản",        "start_offset" : 45, "end_offset" : 49, "type" : "<PHRASE>", "position" : 8 },
    { "token" : "thể",         "start_offset" : 50, "end_offset" : 53, "type" : "<PHRASE>", "position" : 9 },
    { "token" : "洛浦县",        "start_offset" : 55, "end_offset" : 58, "type" : "<FOREIGN>", "position" : 10 },
    { "token" : "bính",        "start_offset" : 60, "end_offset" : 64, "type" : "<PHRASE>", "position" : 11 },
    { "token" : "âm",          "start_offset" : 65, "end_offset" : 67, "type" : "<PHRASE>", "position" : 12 },
    { "token" : "luòpǔ",       "start_offset" : 69, "end_offset" : 74, "type" : "<PHRASE>", "position" : 13 },
    { "token" : "xiàn",        "start_offset" : 75, "end_offset" : 79, "type" : "<PHRASE>", "position" : 14 },
    { "token" : "hán",         "start_offset" : 81, "end_offset" : 84, "type" : "<PHRASE>", "position" : 15 },
    { "token" : "abraxas",     "start_offset" : 86, "end_offset" : 93, "type" : "<PHRASE>", "position" : 16 },
    { "token" : "friedrichi",  "start_offset" : 94, "end_offset" : 104, "type" : "<PHRASE>", "position" : 17 },
    { "token" : "một",         "start_offset" : 108, "end_offset" : 111, "type" : "<PHRASE>", "position" : 19 },
    { "token" : "loài",        "start_offset" : 112, "end_offset" : 116, "type" : "<PHRASE>", "position" : 20 },
    { "token" : "bướm",        "start_offset" : 117, "end_offset" : 121, "type" : "<PHRASE>", "position" : 21 },
    { "token" : "đêm",         "start_offset" : 122, "end_offset" : 125, "type" : "<PHRASE>", "position" : 22 },
    { "token" : "trong",       "start_offset" : 126, "end_offset" : 131, "type" : "<PHRASE>", "position" : 23 },
    { "token" : "họ",          "start_offset" : 132, "end_offset" : 134, "type" : "<PHRASE>", "position" : 24 },
    { "token" : "geometridae", "start_offset" : 135, "end_offset" : 146, "type" : "<PHRASE>", "position" : 25 },
    { "token" : "dữ liệu",     "start_offset" : 148, "end_offset" : 155, "type" : "<PHRASE>", "position" : 26 },
    { "token" : "liên quan",   "start_offset" : 156, "end_offset" : 165, "type" : "<PHRASE>", "position" : 27 },
    { "token" : "tới",         "start_offset" : 166, "end_offset" : 169, "type" : "<PHRASE>", "position" : 28 },
*   { "token" : "abraxas",     "start_offset" : 86, "end_offset" : 93, "type" : "<PHRASE>", "position" : 29 },
*   { "token" : "friedrichi",  "start_offset" : 94, "end_offset" : 104, "type" : "<PHRASE>", "position" : 30 },
    { "token" : "wikispecies", "start_offset" : 193, "end_offset" : 204, "type" : "<PHRASE>", "position" : 32 }
  ]
}

Plugin version for ES 2.4.4

Dear @duydo, could you help me to create plugin for ES 2.4.4 because it will allow me to complete my task. My request is due to the fact that I also use your plug-in in conjunction with another, but now I can not upgrade ES to version 5.
Thanks in advance

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.