Comments (8)
Cool! I've downloaded it and I'll give it a try next week.
from elasticsearch-analysis-vietnamese.
Oddly, I'm getting messages from @duydo in email, but not seeing them here. Not sure where/how to respond properly.
from elasticsearch-analysis-vietnamese.
Hi @Trey314159,
I've just fixed the offset issue #25 and implemented your suggestions for the tokenizer. Now the tokenizer does not split non-Vietnamese characters, does not tokenize the capitalized words together, split words/character correctly.
For the text sách .. sách ; 東京都
:
{
"tokens" : [
{
"token" : "sách",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<PHRASE>",
"position" : 0
},
{
"token" : "sách",
"start_offset" : 8,
"end_offset" : 12,
"type" : "<PHRASE>",
"position" : 1
},
{
"token" : "東京都",
"start_offset" : 15,
"end_offset" : 18,
"type" : "<FOREIGN>",
"position" : 2
}
]
}
For the text año vs Año, r=22 vs R=22, x&y vs X&Y, x*y vs X*Y, x+y vs X+Y
:
{
"tokens" : [
{
"token" : "año",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<PHRASE>",
"position" : 0
},
{
"token" : "vs",
"start_offset" : 4,
"end_offset" : 6,
"type" : "<PHRASE>",
"position" : 1
},
{
"token" : "año",
"start_offset" : 7,
"end_offset" : 10,
"type" : "<PHRASE>",
"position" : 2
},
{
"token" : "r",
"start_offset" : 12,
"end_offset" : 13,
"type" : "<PHRASE>",
"position" : 3
},
{
"token" : "22",
"start_offset" : 14,
"end_offset" : 16,
"type" : "<NUMBER>",
"position" : 4
},
{
"token" : "vs",
"start_offset" : 17,
"end_offset" : 19,
"type" : "<PHRASE>",
"position" : 5
},
{
"token" : "r",
"start_offset" : 20,
"end_offset" : 21,
"type" : "<PHRASE>",
"position" : 6
},
{
"token" : "22",
"start_offset" : 22,
"end_offset" : 24,
"type" : "<NUMBER>",
"position" : 7
},
{
"token" : "x",
"start_offset" : 26,
"end_offset" : 27,
"type" : "<PHRASE>",
"position" : 8
},
{
"token" : "y",
"start_offset" : 28,
"end_offset" : 29,
"type" : "<PHRASE>",
"position" : 9
},
{
"token" : "vs",
"start_offset" : 30,
"end_offset" : 32,
"type" : "<PHRASE>",
"position" : 10
},
{
"token" : "x",
"start_offset" : 33,
"end_offset" : 34,
"type" : "<PHRASE>",
"position" : 11
},
{
"token" : "y",
"start_offset" : 35,
"end_offset" : 36,
"type" : "<PHRASE>",
"position" : 12
},
{
"token" : "x",
"start_offset" : 38,
"end_offset" : 39,
"type" : "<PHRASE>",
"position" : 13
},
{
"token" : "y",
"start_offset" : 40,
"end_offset" : 41,
"type" : "<PHRASE>",
"position" : 14
},
{
"token" : "vs",
"start_offset" : 42,
"end_offset" : 44,
"type" : "<PHRASE>",
"position" : 15
},
{
"token" : "x",
"start_offset" : 45,
"end_offset" : 46,
"type" : "<PHRASE>",
"position" : 16
},
{
"token" : "y",
"start_offset" : 47,
"end_offset" : 48,
"type" : "<PHRASE>",
"position" : 17
},
{
"token" : "x",
"start_offset" : 50,
"end_offset" : 51,
"type" : "<PHRASE>",
"position" : 18
},
{
"token" : "y",
"start_offset" : 52,
"end_offset" : 53,
"type" : "<PHRASE>",
"position" : 19
},
{
"token" : "vs",
"start_offset" : 54,
"end_offset" : 56,
"type" : "<PHRASE>",
"position" : 20
},
{
"token" : "x",
"start_offset" : 57,
"end_offset" : 58,
"type" : "<PHRASE>",
"position" : 21
},
{
"token" : "y",
"start_offset" : 59,
"end_offset" : 60,
"type" : "<PHRASE>",
"position" : 22
}
]
}
I attached here a beta build for your ES v5.3.2 elasticsearch-analysis-vietnamese-5.3.2.zip, it would be great if you can test it again.
Thank you very much for your contribution.
from elasticsearch-analysis-vietnamese.
@duydo, thanks for the update! I've downloaded the zip and I'll try to give it a thorough testing next week and report back here.
from elasticsearch-analysis-vietnamese.
@Trey314159 I just found a bug when doing the query, please don't test that version util it's fixed.
from elasticsearch-analysis-vietnamese.
@duydo, let me know when there's a new version.
from elasticsearch-analysis-vietnamese.
Hi @Trey314159,
After investigating I found that the original tokenizer is not thread-safe, it's a root cause of the NullPointerException as you reported. I've just re-implemented a new thread-safe tokenizer, can you please give it a try? Here is the new version for ES v5.3.2 elasticsearch-analysis-vietnamese-5.3.2.zip.
Again, thank you very much for raising this critical issue :-)
from elasticsearch-analysis-vietnamese.
I close this issue, it will be fixed in new tokenizer #37
from elasticsearch-analysis-vietnamese.
Related Issues (20)
- Logical operators are tokenized as lower case words HOT 1
- Error testVietnameseTokenizer(org.elasticsearch.index.analysis.VietnameseAnalysisTests) HOT 2
- Bị lỗi khi cài đặt trên MacOS HOT 2
- Support for ES 8.3? HOT 8
- Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.3:compile (default-compile) on project elasticsearch-analysis-vietnamese: Fatal error compiling: invalid target release: 11 -> [Help 1] HOT 4
- support for elasticsearch version 8.6.1? HOT 2
- Lỗi Could not find plugin class [org.elasticsearch.plugin.analysis.vi.AnalysisVietnamesePlugin] HOT 8
- BUILD FAILURE VietnameseAnalyzerProvider.java:[33,9] HOT 6
- Fuzzy search và elasticsearch-analysis-vietnamese HOT 4
- 8.7.0 error HOT 3
- _analyze not working with vietnamese HOT 2
- 8.7.0 error HOT 2
- Error when test tokenizer HOT 6
- Cannot start elasticsearch when install plugin
- Hỗ trợ 8.12? HOT 1
- build plugin 8.11.3 lỗi ạ HOT 1
- Hỏi về cài đặt thư viện HOT 1
- tokenizer not found HOT 3
- Setting index
- how to use HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from elasticsearch-analysis-vietnamese.