Giter Site home page Giter Site logo

加入了对数量词的识别! about hanlp HOT 8 CLOSED

hankcs avatar hankcs commented on May 18, 2024
加入了对数量词的识别!

from hanlp.

Comments (8)

hankcs avatar hankcs commented on May 18, 2024

感谢支持,现在分词器已经全面支持了数词和数量词!

from hanlp.

yuchaozhou avatar yuchaozhou commented on May 18, 2024

StandardTokenizer.SEGMENT.enableNumberQuantifierRecognize(true);
String[] testCase = new String[]
{
"十九元套餐包括什么",
"九千九百九十九朵玫瑰",
"壹佰块都不给我",
"9012345678只蚂蚁",
};
for (String sentence : testCase)
{
System.out.println(StandardTokenizer.segment(sentence));
}

=======================输出结果========================
[十/m, 九/b, 元/q, 套餐/n, 包括/v, 什么/ry]
[九/b, 千/m, 九/b, 千百/m, 九/b, 十/m, 九/b, 朵/q, 玫瑰/n]
[壹佰块/mq, 都/d, 不/d, 给/p, 我/rr]
[9012345678/m, 只/d, 蚂蚁/n]

其中,“九千九百九十九朵玫瑰” 分词结果出来“千百”???

from hanlp.

hankcs avatar hankcs commented on May 18, 2024

另外,由于我修改了data/dictionary/CoreNatureDictionary.txt,所以需要删除缓存data/dictionary/CoreNatureDictionary.txt.bin才能生效。

from hanlp.

hankcs avatar hankcs commented on May 18, 2024

data-for-1.1.5.zip依然是旧版数据,等下次发布新版本的时候,新缓存也会被压缩到data.zip,自然就没这个问题了。

from hanlp.

a198720 avatar a198720 commented on May 18, 2024

博主! 在索引分词的时候,貌似没有对数量词最小粒度的切分.

    IndexTokenizer.SEGMENT.enableNumberQuantifierRecognize(true);
    String[] testCase = new String[]
            {   
                    "中华人民共和国",
                    "十九元套餐包括什么",
                    "九千九百九十九朵玫瑰",
                    "壹佰块都不给我",
                    "9012345678只蚂蚁"
            };
    for (String sentence : testCase)
    {
        System.out.println(IndexTokenizer.segment(sentence));
    }
==============分词结果===================

[中华人民共和国/ns, 中华/nz, 中华人民/nz, 华人/n, 人民/n, 共和/n, 共和国/n]
[十九元/mq, 套餐/n, 包括/v, 什么/r]
[九千九百九十九朵/mq, 玫瑰/n]
[壹佰块/m, 都/d, 不/d, 给/p, 我/r]
[9012345678只/mq, 蚂蚁/n]

from hanlp.

hankcs avatar hankcs commented on May 18, 2024

数量词最小粒度的切分具体应该是什么效果呢?拆成单字吗?

from hanlp.

a198720 avatar a198720 commented on May 18, 2024

比方说 十九元 应该是拆分成 [十九元/mq, 十九/m ,元/q] 这种类型. 其实和ik模式差不多,一个是正向最大匹配,也就是相当于hanLP中的标准分词或者是智能分词. 最小匹配,就是配置所有可能分词情况.

from hanlp.

hankcs avatar hankcs commented on May 18, 2024

已经改进了,你再试试看

from hanlp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.