ParsiAnalyzer
ParsiAnalyzer is an analysis plugin for Elasticsearch. Analysis is a process that consists of the following steps:
- Tokenizing a block of text into individual terms
- Normalizing these terms into a standard form
An analyzer is really just a wrapper that combines Character filters, Tokenizer, and Token filters. Elasticsearch provides many Built-in Analyzers but there's still room for improvement especially for Persian language. This plugin provides tools for tokenizing, normalizing and stemming Persian text.
Key features
-
Tokenize Persian text
- Convert whitespaces to zero width nonjoiner (
نیمفاصله
) whenever it is necessary. for example,می رود
toمیرود
. - Convert Persian punctuations to their English equivalent. for example,
۳/۱۴
to۳.۱۴
- Tokenize Persian text by whitespaces and punctuations.
- Convert whitespaces to zero width nonjoiner (
-
Normalize Persian tokens into a single canonical form
- Transform all forms of Yeh, Kaf, Heh, and Hamza to a unique form. for example,
براي
toبرای
. - Convert all Persian and Arabic numbers to their English equivalent. for example,
۱۴۳
to143
. - Remove diacritic (
اِعراب
) from words. for example,اَرّه
toاره
. - Remove Kashida form words. for example,
بادبــــــادک
toبادبادک
.
- Transform all forms of Yeh, Kaf, Heh, and Hamza to a unique form. for example,
-
Remove common Persian stop words
- Persian stop words like
از
,به
and etc will be removed.
- Persian stop words like
-
Stem Persian words
- Remove common Persian suffixes. for example,
ها
orان
.
- Remove common Persian suffixes. for example,
Installation
To install the plugin for Elasticsearch 7.8.0, run this command:
bin\elasticsearch-plugin install https://www.dropbox.com/s/k84y72gavy1x8a6/ParsiAnalyzer-7.8.0.zip?dl=1
Usage
To see how this plugin works, you can use Elasticsearch's analyze
API:
POST _analyze
{
"analyzer" : "parsi",
"text" : "روباه قهوهاي چابك از روی سگ تنبل می پرد"
}
This will give you these tokens: [روباه,قهوهای,چابک,روی,سگ,تنبل,میپرد]
ParsiAnalyzer can be specified directly in the field mapping as follows:
PUT /my_index
{
"mappings": {
"blog": {
"properties": {
"title": {
"type": "text",
"analyzer": "parsi"
}
}
}
}
}
Contact me
Email: n.esmaielyfard [at] gmail.com