Giter Site home page Giter Site logo

elasticsearch-analysis-voikko's Introduction

DEPRECATED

This plugin has been superceded by elasticsearch-analysis-raudikko, which is fully compatible with this plugin, but does not need native libraries, special permissions and extra configuration to install. It is also about two times faster than this plugin.

Voikko Analysis for Elasticsearch

The Voikko Analysis plugin provides Finnish language analysis using Voikko.

Supported versions

Plugin version Elasticsearch version
0.6.0 7.3.2
0.5.0 5.1.1
0.4.0 2.2.1
0.3.0 1.5.2

If you are not installing the latest version, follow the links in the table to see installation instructions for the old version.

Installing

Installing Voikko

The plugin needs libvoikko shared library to work. Details of installing the library varies based on operating system. In Debian based systems apt-get install libvoikko1 should work.

Next, you'll need to download morpho dictionary (for libvoikko version 4.0+, use morpho dict v5 instead). Unzip this into Voikko's dictionary directory (e.g. /usr/lib/voikko in Debian) or into a directory you specify with dictionaryPath configuration property.

Installing the plugin

Finally, to install the plugin, run:

bin/elasticsearch-plugin install https://github.com/EvidentSolutions/elasticsearch-analysis-voikko/releases/download/v0.6.0/elasticsearch-analysis-voikko-0.6.0.zip

Security policy

Elasticsearch ships with a pretty restrictive security policy. Plugins can specify the permissions that they need in plugin-security.policy. However, elasticsearch-analysis-voikko uses JNA library which is already distributed with Elasticsearch and therefore can't be included in the plugin zip. This means that the security policy bundled with the plugin will not apply to JNA, yet it should be able to load libvoikko from the system.

Therefore you need to create a custom security policy, granting Elasticsearch itself the permission to load libvoikko:

grant {
  permission java.io.FilePermission "<<ALL FILES>>", "read";
  permission java.lang.reflect.ReflectPermission "newProxyInPackage.org.puimula.libvoikko";
};

(You don't really need to grant read access to <<ALL FILES>>, you can pass the location of libvoikko instead.)

Save this as custom-elasticsearch.policy and tell Elasticsearch to load it:

export ES_JAVA_OPTS=-Djava.security.policy=file:/path/to/custom-elasticsearch.policy

Verify installation

After installing the plugin, you can quickly verify that it works by executing:

curl -XGET 'localhost:9200/_analyze' -d '
{
  "tokenizer" : "finnish",
  "filter" : [{"type": "voikko", "libraryPath": "/directory/of/libvoikko", "dictionaryPath": "/directory/of/voikko/dictionaries"}],
  "text" : "Testataan voikon analyysiä tällä tavalla yksinkertaisesti."
}'

If this works without error messages, you can proceed to configure the plugin index.

Configuring

Include finnish tokenizer and voikko filter in your analyzer, for example:

{
  "index": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "finnish",
          "filter": ["lowercase", "voikkoFilter"]
        }
      },
      "filter": {
        "voikkoFilter": {
          "type": "voikko"
        }
      }
    }
  }
}

You can use the following filter options to customize the behaviour of the filter:

Parameter Default value Description
language fi_FI Language to use
dictionaryPath system dependent path to voikko dictionaries
analyzeAll false Use all analysis possibilities or just the first
minimumWordSize 3 minimum length of words to analyze
maximumWordSize 100 maximum length of words to analyze
libraryPath system dependent path to directory containing libvoikko
poolMaxSize 10 maximum amount of Voikko-instances to pool
analysisCacheSize 1024 number of analysis results to cache

Development

To run the tests, you need to specify voikko.home system property which should point to a directory containing libvoikko shared library and subdirectory dicts which contains the morpho dictionary.

License

This library is released under the Apache License, Version 2.0.

elasticsearch-analysis-voikko's People

Contributors

hakuzumon avatar komu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-analysis-voikko's Issues

Verify installation benefits from adding -H option

The verify step requires e.g. header option -H 'Content-Type: application/json' to be added. Without it one gets an error of type:

{"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}

Also for service mode the custom policy file is useful to be added to the jvm.options file of ElasticSearch 7.x ,e.g. via with line:

-Djava.security.policy=file:/etc/elasticsearch/custom-elasticsearch.policy

Support expanding of compound words into separate tokens

I have patched our version of the plugin (based on v0.3.0) and added a configuration parameter expandCompounds to optionally support expanding of compound words (yhdyssanat) into separate tokens.

City-of-Helsinki@9a6bd81

I would like to get this feature into master and upstream, if you find it desirable. I can port it to master myself, but currently we are using 0.3.0.

We have found that extracting the parts of compound words is highly desirable in the index analysis stage, for several reasons:

  • users often misspell compound words and write them separately
  • often parts of compound words (for example "terveys" in "terveysasema") are meaningful and relevant even separated from the compound word

v0.4.0.zip release contents?

Hi, tried installing the v0.4.0.zip release via:

sudo /usr/share/elasticsearch/bin/plugin install https://github.com/EvidentSolutions/elasticsearch-analysis-voikko/archive/v0.4.0.zip
-> Installing from https://github.com/EvidentSolutions/elasticsearch-analysis-voikko/archive/v0.4.0.zip...
Trying https://github.com/EvidentSolutions/elasticsearch-analysis-voikko/archive/v0.4.0.zip ...
Downloading ................................................DONE

but this happens:

Verifying https://github.com/EvidentSolutions/elasticsearch-analysis-voikko/archive/v0.4.0.zip checksums if available ...
NOTE: Unable to verify checksum for downloaded plugin (unable to find .sha1 or .md5 file to verify)
ERROR: Could not find plugin descriptor 'plugin-descriptor.properties' in plugin zip

(elasticsearch version is 2.2.0 installed via rpm and all pre-requirements have been installed. )
Or can I create the plugin zip someway? The zip seem to contain all src files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.