Giter Site home page Giter Site logo

dsonet / elasticsearch-analysis-pinyin Goto Github PK

View Code? Open in Web Editor NEW

This project forked from infinilabs/analysis-pinyin

0.0 1.0 0.0 316 KB

The Pinyin Analysis plugin integrates JPinyin(https://github.com/stuxuhai/jpinyin) module into elasticsearch.

Java 100.00%

elasticsearch-analysis-pinyin's Introduction

Pinyin Analysis for ElasticSearch

The Pinyin Analysis plugin integrates Pinyin4j(http://pinyin4j.sourceforge.net/) module into elasticsearch.

Pinyin4j is a popular Java library supporting convertion between Chinese characters and most popular Pinyin systems. The output format of pinyin could be customized.

you can download this plugin from RTF project(https://github.com/medcl/elasticsearch-rtf)

--------------------------------------------------
| Pinyin4j   Analysis Plugin    | ElasticSearch  |
--------------------------------------------------
| master                        | 1.6.0 -> master|
--------------------------------------------------
| 1.3.0                         | 1.6.0          |
--------------------------------------------------
| 1.2.2                         | 1.0.0          |
--------------------------------------------------
| 1.2.0                         | 0.90.0         |
--------------------------------------------------
| 1.1.2                         | 0.20.2         |
--------------------------------------------------
| 1.1.1                         | 0.19.x         |
--------------------------------------------------
| 1.1.0                         | 0.19.0         |
--------------------------------------------------

The plugin includes a pinyin analyzer , two tokenizer: pinyin pinyin_first_letter and a token-filter: pinyin .

1.Create a index for doing some tests

curl -XPUT http://localhost:9200/medcl/ -d'
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin",
                    "filter" : "word_delimiter
]                }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "first_letter" : "none",
                    "padding_char" : " "
                }
            }
        }
    }
}'

2.Analyzing a chinese name,such as 刘德华

http://localhost:9200/medcl/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e&analyzer=pinyin_analyzer
{"tokens":[{"token":"liu de hua ","start_offset":0,"end_offset":3,"type":"word","position":1}]}

3.Thant's all,have fun.

optional config: the parameter first_letter can be set to: prefix , append , only and none ,default value is none

examples: first_letter set toprifix and padding_char is set to "" the analysis result will be:

{"tokens":[{"token":"ldhliudehua","start_offset":0,"end_offset":3,"type":"word","position":1}]}

and if we set first_letter to only ,the result will be:

{"tokens":[{"token":"ldh","start_offset":0,"end_offset":3,"type":"word","position":1}]}

also first_letter to append

{"tokens":[{"token":"liu de hua ldh","start_offset":0,"end_offset":3,"type":"word","position":1}]}

----------additional----------example-----------------------

if you wanna do a auto-complete with people's name,combining with the magic of pinyin,and it's very easy now,here is the detail instructions:

1.Index setting

curl -XPOST http://localhost:9200/medcl/_close
curl -XPUT http://localhost:9200/medcl/_settings -d'
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin",
                    "filter" : ["word_delimiter","nGram"]
                }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "first_letter" : "prefix",
                    "padding_char" : " "
                }
            }
        }
    }
}'
curl -XPOST http://localhost:9200/medcl/_open

2.Create mapping

curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'
{
    "folks": {
        "properties": {
            "name": {
                "type": "multi_field",
                "fields": {
                    "name": {
                        "type": "string",
                        "store": "no",
                        "term_vector": "with_positions_offsets",
                        "analyzer": "pinyin_analyzer",
                        "boost": 10
                    },
                    "primitive": {
                        "type": "string",
                        "store": "yes",
                        "analyzer": "keyword"
                    }
                }
            }
        }
    }
}'

3.Indexing

curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"刘德华"}'

4.Have a try

curl http://localhost:9200/medcl/folks/_search?q=name:%e5%88%98
curl http://localhost:9200/medcl/folks/_search?q=name:%e5%88%98%e5%be%b7
curl http://localhost:9200/medcl/folks/_search?q=name:liu
curl http://localhost:9200/medcl/folks/_search?q=name:ldh
curl http://localhost:9200/medcl/folks/_search?q=name:dehua

5.Use Pinyin-TokenFilter (contributed by @wangweiwei)

curl -XPUT http://localhost:9200/medcl1/ -d'
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "user_name_analyzer" : {
                    "tokenizer" : "whitespace",
                    "filter" : "pinyin_filter"
                }
            },
            "filter" : {
                "pinyin_filter" : {
                    "type" : "pinyin",
                    "first_letter" : "only",
                    "padding_char" : ""
                }
            }
        }
    }
}'

Token Test:刘德华 张学友 郭富城 黎明 四大天王

curl -XGET http://localhost:9200/medcl/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e+%e5%bc%a0%e5%ad%a6%e5%8f%8b+%e9%83%ad%e5%af%8c%e5%9f%8e+%e9%bb%8e%e6%98%8e+%e5%9b%9b%e5%a4%a7%e5%a4%a9%e7%8e%8b&analyzer=user_name_analyzer
{"tokens":[{"token":"ldh","start_offset":0,"end_offset":3,"type":"word","position":1},{"token":"zxy","start_offset":4,"end_offset":7,"type":"word","position":2},{"token":"gfc","start_offset":8,"end_offset":11,"type":"word","position":3},{"token":"lm","start_offset":12,"end_offset":14,"type":"word","position":4},{"token":"sdtw","start_offset":15,"end_offset":19,"type":"word","position":5}]}

elasticsearch-analysis-pinyin's People

Contributors

doabit avatar dsonet avatar lrsec avatar medcl avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.