huaban / jieba-analysis Goto Github PK
View Code? Open in Web Editor NEW结巴分词(java版)
Home Page: https://github.com/huaban/jieba-analysis
License: Apache License 2.0
结巴分词(java版)
Home Page: https://github.com/huaban/jieba-analysis
License: Apache License 2.0
JiebaSegmenter segmenter = new JiebaSegmenter();
这个类是线程安全的吗
WordDictionary dictAdd = WordDictionary.getInstance();
File file = new File("D:/jieba-analysis/conf/user.dict");
dictAdd.loadUserDict(file);
使用LZ的程序在默认不加载user.dict时文本中的“鲜芋仙”会被分成“鲜芋”和“仙”,然后我在user.dict里面加入了一行“鲜芋仙 3”,然后再程序中load了进来。从console里面可以看到已经加载了自定义词库,但是分词结果并没有变化,请问是自定义词库写的有问题吗?谢谢。
在讀取loadUserDict的時候會因為系統的編碼,使用錯誤的編碼格式去讀取字典檔內容。
這是我增加的Code,增加loadUserDict的編碼參數。
https://github.com/sephXD/jieba-analysis/commit/96187592d3e511b54f0dbbcc3b7d448f73910a11
不太會使用Git,還請多見諒。
JiebaSegmenter segmenter = new JiebaSegmenter();
,然后再多线程调用segmenter.process
方法,这样会不会有问题?如题
有支持jdk1.8以下版本的么
build的时候出错
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-gpg-plugin:1.4:sign (sign-artifacts) on project jieba-analysis: Exit code: 2 -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
试过 loadUserDict不行,能出个使用说明吗?
JiebaSegmenter.class 调用sentenceProcess 切分鲜芋仙 3会报空指针异常。
感谢huaban这个Java版本,用起来不错!不过怎样支持日期呢?例如2014-08-28?
你好,
1、我在Solr引擎中配置jieba分词器,请问有可以直接使用的Jar包么
2、另外jieba分词器在处理英文字符和数字一起的字符串时候默认是不能进行近一步分词的,我想把他分词出来,该如何修改源码?
比如SourceString = “C49D47618”,经过分词,我想得到如下分词:“C”、“49”、“D”、“47618”、“C49D47618”
import java.nio.file.DirectoryStream;
import java.nio.file.Files;
import java.nio.file.Path;
can't resolved
Hi 當我使用sentenceProcess,會出現
Exception in thread "main" java.lang.NullPointerException
at com.huaban.analysis.jieba.JiebaSegmenter.createDAG(JiebaSegmenter.java:28)
at com.huaban.analysis.jieba.JiebaSegmenter.sentenceProcess(JiebaSegmenter.java:158)
JiebaSegmenter segmenter = new JiebaSegmenter();
System.out.println(segmenter.sentenceProcess("黑 X").toString());
但是使用process
System.out.println(segmenter.process("黑 X", SegMode.INDEX).toString());
卻不會有這個例外錯誤。
謝謝
我相拥eclipse加载这个工程,但是只加载的文件夹,没有工程,也无法执行
是缺少.classpath和.project文件吗?
比如直接返回一个分完词的list或者返回一句用空格分隔的话,或者有没有文本处理语句?
hi all,
I know that jieba-analysis support add custom dict by using:
WordDictionary.loadUserDict(dataDir);
can we add a method that let user to reset the default dict ?
thanks
您好,请问有可以直接使用的Jar包吗?非常感谢!
看trie的实现 有没有稍微好点的文档瞧瞧
提示cannot resolve symbol 'junit',maven中reimport还是不行。请问该如何解决呢?使用的编辑器是intellij
您好,我想把jieba用在Android项目中,但是模型加载的速度非常慢,通常需要15--20秒的时间。请问有什么方法可以使模型加载速度更快一些么?
android sdk 没有包含nio.file 下面的类, 哪里可以下载
运行了一下词性标注的代码,结果只有每个词的开始index和词长length,并没有词性啊
homedir = os.getcwd()
jieba.load_userdict(os.path.join(homedir, "words.txt"))
执行时报如下错误:
File "extract_tags.py", line 14, in <module>
jieba.load_userdict(os.path.join(homedir, "words.txt"))
File "/Library/Python/2.7/site-packages/jieba/__init__.py", line 381, in load_userdict
f.name, lineno, line))
ValueError
这些功能有吗
看了说明,好像是由于效率问题把POS功能给去掉了,请问JAVA的这几个版本中哪几个是带词性标注的,还是都没有添加该功能?
请问,为什么在程序中要先将大写字母转换为小写字母才能提高识别的准确度? char ch = CharacterUtil.regularize(paragraph.charAt(i));, 这条语句我屏蔽过,但是发现一旦屏蔽,分词识别准确率下降,不知道是什么原因,比如“C++”, 不屏蔽的情况是可以直接识别"c++", 一旦屏蔽,就变成:“C”,“++”。显然正确的识别应该是“C++”。
程序自动将大写字母转换为小写在分词以后中处理过程中会出问题。
如题
loadDict 和loadUserDict中freqs.put(word, Math.log(freq / total)); 处理方式不同,加载用户字典时没有重新计算total值
添加了自定义词典,如下:
UTF-8 3 nz
utf-8 3 nz
日志显示加载成功
user dict C:\Users\pc\Desktop\eclipse\workspace\neo4jlearn\conf\user.dict load finished, tot words:2, time elapsed:1ms
但是,被拆成了三个词 "utf","-"和"8"。
看了一下c++这种是可以的。
另外,英文会被转成小写么?
Exception in thread "main" java.lang.NumberFormatException: For input string: "server"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
at java.lang.Double.valueOf(Double.java:504)
at com.huaban.analysis.jieba.WordDictionary.loadUserDict(WordDictionary.java:151)
at com.huaban.analysis.jieba.WordDictionary.loadUserDict(WordDictionary.java:134)
报错的词是sql server
Hi, 你好,在看你源代码。有点疑惑,来问一下你:
private Map<Integer, Pair> calc(String sentence, Map<Integer, List> dag)
这个函数里面:
double freq = wordDict.getFreq(sentence.substring(i, x + 1)) + route.get(x + 1).freq;
问什么会这么写呢? 相邻两个词的频率相加,而且是从后往前加,后面的频率会被传递到第一个词哪里去,有没有什么理论依据呢?求解啊
请问一下没有发布的词性标注版本,执行效率大概是多少?是不是按照python版本的词性标注实现的?
import jieba
Traceback (most recent call last):
File "", line 1, in
File "/Users/mengbin/jython2.7.0/Lib/site-packages/jieba/init.py", line 15, in
from ._compat import *
ImportError: No module named _compat
请问分词后,如何获得每个词的词性?
build的时候出错
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-gpg-plugin:1.4:sign (sign-artifacts) on project jieba-analysis: Exit code: 2 -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
为什么回滚版本放弃词性标注了?
会再继续开发比如python版本中的词性标注等更多的特性吗?
看了一下代码.只有 词, 次频,词性 都有的次才会load.. 可是本身结巴分词是不限制的呢.
这个模型文件怎么训练出来的啊。训练语料用的是?以及对应的训练代码有吗?我这边想更新一下这个模型文件。因为我用的分词器是电商领域。
@piaolingxue 你好:我下载你的jieba-analysis和analyzer-solr(https://github.com/sing1ee/analyzer-solr.git)在Solr5.5.2中使用,配置文件如下:
< fieldType name="text_jieba" class="solr.TextField" positionIncrementGap="100">
< analyzer type="index">
< tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory" segMode="SEARCH"/>
< filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
< filter class="solr.LowerCaseFilterFactory"/>
< filter class="solr.SnowballPorterFilterFactory" language="English"/>
< /analyzer>
< analyzer type="query">
< tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory" segMode="SEARCH"/>
< filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
< filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
< filter class="solr.LowerCaseFilterFactory"/>
< filter class="solr.SnowballPorterFilterFactory" language="English"/>
< /analyzer>
< /fieldType>
报错信息如下:
2016/8/11 下午3:10:33 ERROR null HttpSolrCall null:java.lang.RuntimeException: java.lang.NoSuchFieldError: word
null:java.lang.RuntimeException: java.lang.NoSuchFieldError: word
at org.apache.solr.servlet.HttpSolrCall.sendError(HttpSolrCall.java:607)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:475)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:192)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:108)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)
.................
Caused by: java.lang.NoSuchFieldError: word
at analyzer.solr5.jieba.JiebaTokenizer.incrementToken(JiebaTokenizer.java:38)
at org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeTokenStream(AnalysisRequestHandlerBase.java:188)
at org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeValue(AnalysisRequestHandlerBase.java:127)
at org.apache.solr.handler.FieldAnalysisRequestHandler.analyzeValues(FieldAnalysisRequestHandler.java:220)
at org.apache.solr.handler.FieldAnalysisRequestHandler.handleAnalysisRequest(FieldAnalysisRequestHandler.java:181)
.............
请问我是哪里配置错了?该如何解决,谢谢!!!
Hi,
I'm using Jieba analyser to index Chinese characters in the Solr. It works fine with the segmentation when using the Anaylsis on the Solr Admin UI.
However, when I tried to do highlighting in Solr, it is not highlighting in the correct place. For example, when I search for 自然环境与企业本身, it highlight 认|em|为自然环|/em||em|境|/em||em|与企|/em||em|业本|/em|身的
Even when I search English character responsibility, it highlight |em| responsibilit|em|y.
Basically the highlighting goes off by 1 character/space consistently.
Anyone knows what could be the issue?
I'm using jieba-analysis-1.0.0, Solr 5.2.1 and Lucene 5.1.0
Regards,
Edwin
在使用过WordDictionary.getInstance().loadUserDict自己的词库之前,可以切出5个“解除合同”单词,在加载自己的词库以后,只能切出2个“解除合同”了,词库加载有什么讲究吗?词库制作有什么需要注意的地方吗?
"#8" 中提到了jieba不会支持停用词. 如果要实现停用词的话, 应该如何着手? es-jieba中此功能是实现了lucene的createComponents接口返回StopFilter, 具体去停用词的逻辑应该在lucene内部.
请问win7下如何安装jiaba java版呢? 谢谢
我的疑问有点多,见谅:
1.强制分词
比如类似ansj分词,指定一个ambiguity.dic
有限公司 有限 a 公司 n
强制使“有限公司”分为“有限” “公司”。
或者类似Python结巴里的调整词典这个用法:
···
print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
···
如果/放到/post/中将/出错/。
···
jieba.suggest_freq(('中', '将'), True)
···
494
···
print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
···
如果/放到/post/中/将/出错/。
不知道有没有这样的方法?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.