Comments (8)
@shi-yuan 非常感谢你的启发,我去看了一下官方的所有filter,发现pattern_replace
可以满足我的需求
配置如下:
char_filter :
price_patern :
type: pattern_replace
pattern: (\d{1,4})\.(\d{1,2})万
replacement: $1$2
可以把11.11万
转为1111
希望可以帮到需要的人。
非常感谢~
from elasticsearch-analysis-ansj.
ansj配置:
ansj:
dic_path: "ansj/dic/user/" ##用户词典位置
ambiguity_path: "ansj/dic/ambiguity.dic" ##歧义词典
enable_name_recognition: true ##人名识别
enable_num_recognition: true ##数字识别
enable_quantifier_recognition: false ##量词识别
enabled_stop_filter: true ##是否基于词典过滤
stop_path: "ansj/dic/stopLibrary.dic" ##停止过滤词典
stopLibrary.dic的内容:
与
专业
测试 http://127.0.0.1:9200/_cat/test/analyze?text=我与小明的专业是计算机&analyzer=dic_ansj :
我 0 1 0 r
小明 2 4 1 nz
的 4 5 2 uj
是 7 8 3 v
计算机 8 11 4 n
没问题的。
你可以重新install一下,之前有过更新
from elasticsearch-analysis-ansj.
ok,更新到2.3.3.1就没问题了,thx
from elasticsearch-analysis-ansj.
补充一下,
配置是可以不用加的,用默认配置就行;
停词库是对三个analyzer都生效,不光是dic_ansj
from elasticsearch-analysis-ansj.
关于停用词有一个问题,比如我有个价格需要分词:11.11万
我想达到分词的效果是 1111
或者11 11
,
试着把 万
和 .
设置为停用词,
使用index_ansj
的分词结果为 :
{
"tokens": [
{
"token": "11.11万",
"start_offset": 0,
"end_offset": 6,
"type": "m",
"position": 0
},
{
"token": "11",
"start_offset": 0,
"end_offset": 2,
"type": "n",
"position": 1
},
{
"token": "11",
"start_offset": 3,
"end_offset": 5,
"type": "n",
"position": 2
}
]
}
使用query_ansj
的结果是:
{
"tokens": [
{
"token": "11.11万",
"start_offset": 0,
"end_offset": 6,
"type": "m",
"position": 0
}
]
}
可以看到两个结果中都包含了11.11万
,这个token里面包含了我的停用词
所以我的疑问是,停用词只能达到从分词结果中过滤,而不能影响分词的结果吗?
@shi-yuan
from elasticsearch-analysis-ansj.
是的,多个character filter -> 一个tokenizer -> 多个token filter(包括stop filter)
把数字识别设置成false,万可以过滤掉。数字识别为true,和数字在一起,万是被当作数字处理的,11.11万 就被分到了一起
点是不行的,分词分出来的就是11.11
可以看看官方的Word Delimiter Token Filter:https://www.elastic.co/guide/en/elasticsearch/reference/2.3/analysis-word-delimiter-tokenfilter.html
from elasticsearch-analysis-ansj.
@shi-yuan
感谢答复~
把量词识别设置成false,万可以过滤掉
把enable_quantifier_recognition
设置为true
或者false
,都没有过滤掉万
,是不是这里有bug
from elasticsearch-analysis-ansj.
@suclogger 不好意思。你的这个和量词识别没关系,针对的是数字识别
from elasticsearch-analysis-ansj.
Related Issues (20)
- elasticsearch6.4.0如何配置mysql方式的热加载词典 HOT 3
- 7.8 ansj 插件的实现是不是有点过时了
- 配置自定义字典为jdbc方式,启动elasticsearch后报错 HOT 5
- 配置自定义词典都没有成功是什么原因 HOT 1
- 7.6.2.0版本 HOT 11
- ansj.cfg.yml HOT 2
- 配置中文停用词不生效 HOT 1
- 如何实现短语屏蔽功能 HOT 6
- 2.4.5版本中flush/dic接口奇怪现象 HOT 9
- 除修改config/ansj.cfg.yml添加自定义词典,还有其他方式添加自定义词典吗? HOT 8
- 怎么在插件中加载自己训练的crf模型 HOT 1
- 使用自定义停用词库后报错 HOT 1
- 请问后续能支持8.4.1版本吗?8.3.3版本安装不上 HOT 4
- 8.3.3版本的包各种报错 HOT 1
- 7.10.x版本支持 HOT 1
- 7.17.9版本支持 HOT 2
- 8.7.0版本配置完自定义词典后,分词报error HOT 2
- 如何热更新词语 HOT 1
- 8.7.1版本 _analyze 报错 HOT 2
- es8.8.2配置分词不生效 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from elasticsearch-analysis-ansj.