博主! 我加入了对数量词的识别! 主题代码如下: package com.hankcs.hanlp.recognition.mq; <p dir="auto

博主! 在索引分词的时候,貌似没有对数量词最小粒度的切分. <div class="snippet-clipboard-content notranslate po

加入了对数量词的识别! about hanlp HOT 8 CLOSED

hankcs commented on May 18, 2024

加入了对数量词的识别!

from hanlp.

Comments (8)

hankcs commented on May 18, 2024

感谢支持，现在分词器已经全面支持了数词和数量词！

from hanlp.

yuchaozhou commented on May 18, 2024

StandardTokenizer.SEGMENT.enableNumberQuantifierRecognize(true);
String[] testCase = new String[]
{
"十九元套餐包括什么",
"九千九百九十九朵玫瑰",
"壹佰块都不给我",
"９０１２３４５６７８只蚂蚁",
};
for (String sentence : testCase)
{
System.out.println(StandardTokenizer.segment(sentence));
}

=======================输出结果========================
[十/m, 九/b, 元/q, 套餐/n, 包括/v, 什么/ry]
[九/b, 千/m, 九/b, 千百/m, 九/b, 十/m, 九/b, 朵/q, 玫瑰/n]
[壹佰块/mq, 都/d, 不/d, 给/p, 我/rr]
[９０１２３４５６７８/m, 只/d, 蚂蚁/n]

其中，“九千九百九十九朵玫瑰” 分词结果出来“千百”？？？

from hanlp.

hankcs commented on May 18, 2024

另外，由于我修改了data/dictionary/CoreNatureDictionary.txt，所以需要删除缓存data/dictionary/CoreNatureDictionary.txt.bin才能生效。

from hanlp.

hankcs commented on May 18, 2024

data-for-1.1.5.zip依然是旧版数据，等下次发布新版本的时候，新缓存也会被压缩到data.zip，自然就没这个问题了。

from hanlp.

a198720 commented on May 18, 2024

博主! 在索引分词的时候,貌似没有对数量词最小粒度的切分.

    IndexTokenizer.SEGMENT.enableNumberQuantifierRecognize(true);
    String[] testCase = new String[]
            {   
                    "中华人民共和国",
                    "十九元套餐包括什么",
                    "九千九百九十九朵玫瑰",
                    "壹佰块都不给我",
                    "９０１２３４５６７８只蚂蚁"
            };
    for (String sentence : testCase)
    {
        System.out.println(IndexTokenizer.segment(sentence));
    }
==============分词结果===================

[中华人民共和国/ns, 中华/nz, 中华人民/nz, 华人/n, 人民/n, 共和/n, 共和国/n]
[十九元/mq, 套餐/n, 包括/v, 什么/r]
[九千九百九十九朵/mq, 玫瑰/n]
[壹佰块/m, 都/d, 不/d, 给/p, 我/r]
[９０１２３４５６７８只/mq, 蚂蚁/n]

from hanlp.

hankcs commented on May 18, 2024

数量词最小粒度的切分具体应该是什么效果呢？拆成单字吗？

from hanlp.

a198720 commented on May 18, 2024

比方说十九元应该是拆分成 [十九元/mq, 十九/m ,元/q] 这种类型. 其实和ik模式差不多,一个是正向最大匹配,也就是相当于hanLP中的标准分词或者是智能分词. 最小匹配,就是配置所有可能分词情况.

from hanlp.

hankcs commented on May 18, 2024

已经改进了，你再试试看

from hanlp.

Recommend Projects

加入了对数量词的识别! about hanlp HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent