Giter Site home page Giter Site logo

Comments (8)

suclogger avatar suclogger commented on June 14, 2024 1

@shi-yuan 非常感谢你的启发,我去看了一下官方的所有filter,发现pattern_replace可以满足我的需求

配置如下:

char_filter :
      price_patern :
        type: pattern_replace
        pattern: (\d{1,4})\.(\d{1,2})万
        replacement: $1$2

可以把11.11万转为1111
希望可以帮到需要的人。

非常感谢~

from elasticsearch-analysis-ansj.

shi-yuan avatar shi-yuan commented on June 14, 2024

ansj配置:

ansj:
  dic_path: "ansj/dic/user/" ##用户词典位置
  ambiguity_path: "ansj/dic/ambiguity.dic" ##歧义词典
  enable_name_recognition: true ##人名识别
  enable_num_recognition: true ##数字识别
  enable_quantifier_recognition: false ##量词识别
  enabled_stop_filter: true ##是否基于词典过滤
  stop_path: "ansj/dic/stopLibrary.dic" ##停止过滤词典

stopLibrary.dic的内容:

与
专业

测试 http://127.0.0.1:9200/_cat/test/analyze?text=我与小明的专业是计算机&analyzer=dic_ansj

我         0       1       0       r       
小明      2       4       1       nz      
的         4       5       2       uj      
是         7       8       3       v       
计算机       8       11      4       n       

没问题的。

你可以重新install一下,之前有过更新

from elasticsearch-analysis-ansj.

wpzdm avatar wpzdm commented on June 14, 2024

ok,更新到2.3.3.1就没问题了,thx

from elasticsearch-analysis-ansj.

wpzdm avatar wpzdm commented on June 14, 2024

补充一下,
配置是可以不用加的,用默认配置就行;
停词库是对三个analyzer都生效,不光是dic_ansj

from elasticsearch-analysis-ansj.

suclogger avatar suclogger commented on June 14, 2024

关于停用词有一个问题,比如我有个价格需要分词:11.11万
我想达到分词的效果是 1111或者11 11
试着把 . 设置为停用词,
使用index_ansj的分词结果为 :

{
  "tokens": [
    {
      "token": "11.11万",
      "start_offset": 0,
      "end_offset": 6,
      "type": "m",
      "position": 0
    },
    {
      "token": "11",
      "start_offset": 0,
      "end_offset": 2,
      "type": "n",
      "position": 1
    },
    {
      "token": "11",
      "start_offset": 3,
      "end_offset": 5,
      "type": "n",
      "position": 2
    }
  ]
}

使用query_ansj的结果是:

{
  "tokens": [
    {
      "token": "11.11万",
      "start_offset": 0,
      "end_offset": 6,
      "type": "m",
      "position": 0
    }
  ]
}

可以看到两个结果中都包含了11.11万,这个token里面包含了我的停用词
所以我的疑问是,停用词只能达到从分词结果中过滤,而不能影响分词的结果吗?
@shi-yuan

from elasticsearch-analysis-ansj.

shi-yuan avatar shi-yuan commented on June 14, 2024

是的,多个character filter -> 一个tokenizer -> 多个token filter(包括stop filter)

把数字识别设置成false,万可以过滤掉。数字识别为true,和数字在一起,万是被当作数字处理的,11.11万 就被分到了一起

点是不行的,分词分出来的就是11.11

可以看看官方的Word Delimiter Token Filter:https://www.elastic.co/guide/en/elasticsearch/reference/2.3/analysis-word-delimiter-tokenfilter.html

from elasticsearch-analysis-ansj.

suclogger avatar suclogger commented on June 14, 2024

@shi-yuan
感谢答复~

把量词识别设置成false,万可以过滤掉

enable_quantifier_recognition设置为true或者false,都没有过滤掉,是不是这里有bug

from elasticsearch-analysis-ansj.

shi-yuan avatar shi-yuan commented on June 14, 2024

@suclogger 不好意思。你的这个和量词识别没关系,针对的是数字识别

from elasticsearch-analysis-ansj.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.