Comments (8)
感谢支持,现在分词器已经全面支持了数词和数量词!
from hanlp.
StandardTokenizer.SEGMENT.enableNumberQuantifierRecognize(true);
String[] testCase = new String[]
{
"十九元套餐包括什么",
"九千九百九十九朵玫瑰",
"壹佰块都不给我",
"9012345678只蚂蚁",
};
for (String sentence : testCase)
{
System.out.println(StandardTokenizer.segment(sentence));
}
=======================输出结果========================
[十/m, 九/b, 元/q, 套餐/n, 包括/v, 什么/ry]
[九/b, 千/m, 九/b, 千百/m, 九/b, 十/m, 九/b, 朵/q, 玫瑰/n]
[壹佰块/mq, 都/d, 不/d, 给/p, 我/rr]
[9012345678/m, 只/d, 蚂蚁/n]
其中,“九千九百九十九朵玫瑰” 分词结果出来“千百”???
from hanlp.
另外,由于我修改了data/dictionary/CoreNatureDictionary.txt,所以需要删除缓存data/dictionary/CoreNatureDictionary.txt.bin才能生效。
from hanlp.
data-for-1.1.5.zip依然是旧版数据,等下次发布新版本的时候,新缓存也会被压缩到data.zip,自然就没这个问题了。
from hanlp.
博主! 在索引分词的时候,貌似没有对数量词最小粒度的切分.
IndexTokenizer.SEGMENT.enableNumberQuantifierRecognize(true);
String[] testCase = new String[]
{
"中华人民共和国",
"十九元套餐包括什么",
"九千九百九十九朵玫瑰",
"壹佰块都不给我",
"9012345678只蚂蚁"
};
for (String sentence : testCase)
{
System.out.println(IndexTokenizer.segment(sentence));
}
==============分词结果===================
[中华人民共和国/ns, 中华/nz, 中华人民/nz, 华人/n, 人民/n, 共和/n, 共和国/n]
[十九元/mq, 套餐/n, 包括/v, 什么/r]
[九千九百九十九朵/mq, 玫瑰/n]
[壹佰块/m, 都/d, 不/d, 给/p, 我/r]
[9012345678只/mq, 蚂蚁/n]
from hanlp.
数量词最小粒度的切分具体应该是什么效果呢?拆成单字吗?
from hanlp.
比方说 十九元 应该是拆分成 [十九元/mq, 十九/m ,元/q] 这种类型. 其实和ik模式差不多,一个是正向最大匹配,也就是相当于hanLP中的标准分词或者是智能分词. 最小匹配,就是配置所有可能分词情况.
from hanlp.
已经改进了,你再试试看
from hanlp.
Related Issues (20)
- 索引与查找使用相同的analyzer,结果无法命中 HOT 4
- 无法下载CTB9_POS_ELECTRA_SMALL_TF HOT 2
- 解析失败,提示升级hanlp HOT 1
- 依存分析的模型要么下载不了,要么刚开始下载非常慢,然后就下不了了(dep的四个模型都是) HOT 1
- No module named 'hanlp.datasets.parsing.ctb'
- 中文名包含多音字时生成的拼音只有一个,例如 ‘李娜’ 生成拼音为 ‘Li Nuo’ HOT 1
- 执行open_small.py时报'utf-8' codec can't decode byte 0xb4 in position 0: invalid start byte HOT 1
- ================================ERROR LOG BEGINS================================ HOT 1
- When I runing the example occurred error HOT 1
- Add a custom dictionary type that supports spaces HOT 3
- Smatch provide wrong and random scores HOT 2
- portable 1.8.4的更新 请尽快推到portable分支 现在分支上还是1.8.3
- 中文分词(粗分)错误:New in version 3.3. HOT 1
- 中文分词错误:左右捕盜廳以『邪學罪人安敦伊、吳伯多祿、閔유아욱가、黃錫斗、張周基,押付公忠水營,梟警』啓。 HOT 5
- NER模型加载问题 HOT 1
- cpu docker部署安装依赖cuda环境 HOT 1
- 悄悄地问:分词模型能否“理解”语意? HOT 3
- phraseTree引发的import error HOT 2
- AttributeError: module 'keras._tf_keras.keras.layers' has no attribute 'AbstractRNNCell' HOT 3
- 本地单任务模型,加载amr时失败 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hanlp.