Comments (4)
1.使用CustomDictionary添加新词,在粗分阶段会切分出来,没有问题。
2.问题出在ngram词频处理上。
删掉ngram缓存,你需要将"绝对@高大上"加入CoreNatureDictionary.ngram.txt中。
不过,还存在一个问题,ngram加载时,CoreBiGramTableDictionary类对产后接续词进行了是否在核心词典的检测,如果在则加载,若不在则跳过。所以,此处直接用CustomDictionary存在问题是,新添加的词并没有在核心词典,ngram接续词会跳过。
3.存在的问题或许还是需要后续开发完善的。
4.目前直接将“高大上”加入核心词典、绝对@高大上"加入CoreNatureDictionary.ngram.txt后,问题解决。
[外观/n, 绝对/d, 高大上/a, ,/w, 不信/v, 的/ude1, 是/vshi, 没/d, 见过/v, ./w]
from hanlp.
增加个CoreNatureDictionary.add的方法就好了
from hanlp.
建议:增加个CoreNatureDictionary.add的方法,CoreDictionaryPath可以向CustomDictionaryPath一样追加多个词典。
from hanlp.
问题的确如@yuchaozhou所言,在于缺少NGram接续。所以需要
- 把词加到CoreNatureDictionary.txt里
- 把接续加到CoreNatureDictionary.ngram.txt里
带有Core字样的是训练出来的模型,不太希望用户增删,而且双数组tire动态增删代价非常大。所以不打算实现CoreNatureDictionary.add
from hanlp.
Related Issues (20)
- 索引与查找使用相同的analyzer,结果无法命中 HOT 4
- 无法下载CTB9_POS_ELECTRA_SMALL_TF HOT 2
- 解析失败,提示升级hanlp HOT 1
- 依存分析的模型要么下载不了,要么刚开始下载非常慢,然后就下不了了(dep的四个模型都是) HOT 1
- No module named 'hanlp.datasets.parsing.ctb'
- 中文名包含多音字时生成的拼音只有一个,例如 ‘李娜’ 生成拼音为 ‘Li Nuo’ HOT 1
- 执行open_small.py时报'utf-8' codec can't decode byte 0xb4 in position 0: invalid start byte HOT 1
- ================================ERROR LOG BEGINS================================ HOT 1
- When I runing the example occurred error HOT 1
- Add a custom dictionary type that supports spaces HOT 3
- Smatch provide wrong and random scores HOT 2
- portable 1.8.4的更新 请尽快推到portable分支 现在分支上还是1.8.3
- 中文分词(粗分)错误:New in version 3.3. HOT 1
- 中文分词错误:左右捕盜廳以『邪學罪人安敦伊、吳伯多祿、閔유아욱가、黃錫斗、張周基,押付公忠水營,梟警』啓。 HOT 5
- NER模型加载问题 HOT 1
- cpu docker部署安装依赖cuda环境 HOT 1
- 悄悄地问:分词模型能否“理解”语意? HOT 3
- phraseTree引发的import error HOT 2
- AttributeError: module 'keras._tf_keras.keras.layers' has no attribute 'AbstractRNNCell' HOT 3
- 本地单任务模型,加载amr时失败 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hanlp.