Giter Site home page Giter Site logo

jieba-analysis's People

Contributors

bluemapleman avatar cctyl avatar ender503 avatar linkerlin avatar menghan avatar piaolingxue avatar sharkdoodoo avatar tokikanno avatar weakish avatar yarson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jieba-analysis's Issues

自定义词库的问题

使用LZ的程序在默认不加载user.dict时文本中的“鲜芋仙”会被分成“鲜芋”和“仙”,然后我在user.dict里面加入了一行“鲜芋仙 3”,然后再程序中load了进来。从console里面可以看到已经加载了自定义词库,但是分词结果并没有变化,请问是自定义词库写的有问题吗?谢谢。

java版结巴 自定义添加词典

    WordDictionary dictAdd = WordDictionary.getInstance();

    File file = new File("D:/jieba-analysis/conf/user.dict");
    dictAdd.loadUserDict(file);

请问有没有强制指定个别词语分词的方法?

我的疑问有点多,见谅:
1.强制分词
比如类似ansj分词,指定一个ambiguity.dic
有限公司 有限 a 公司 n
强制使“有限公司”分为“有限” “公司”。

或者类似Python结巴里的调整词典这个用法:
···

print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
···
如果/放到/post/中将/出错/。
···
jieba.suggest_freq(('中', '将'), True)
···
494
···
print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
···
如果/放到/post/中/将/出错/。

不知道有没有这样的方法?

  1. 另外想请教,有没有办法可以同时存在两个分词器,一个使用用户词典,一个使用默认词典?

build的时候出错

build的时候出错
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-gpg-plugin:1.4:sign (sign-artifacts) on project jieba-analysis: Exit code: 2 -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

词性标注

为什么回滚版本放弃词性标注了?

使用词性

请问分词后,如何获得每个词的词性?

jython can't run the jieba

import jieba
Traceback (most recent call last):
File "", line 1, in
File "/Users/mengbin/jython2.7.0/Lib/site-packages/jieba/init.py", line 15, in
from ._compat import *
ImportError: No module named _compat

build的时候出错

build的时候出错
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-gpg-plugin:1.4:sign (sign-artifacts) on project jieba-analysis: Exit code: 2 -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

关于分词程序中自动将字母大写转换成小写的问题

请问,为什么在程序中要先将大写字母转换为小写字母才能提高识别的准确度? char ch = CharacterUtil.regularize(paragraph.charAt(i));, 这条语句我屏蔽过,但是发现一旦屏蔽,分词识别准确率下降,不知道是什么原因,比如“C++”, 不屏蔽的情况是可以直接识别"c++", 一旦屏蔽,就变成:“C”,“++”。显然正确的识别应该是“C++”。
程序自动将大写字母转换为小写在分词以后中处理过程中会出问题。

使用自定义字典报ValueError错误

homedir = os.getcwd()
jieba.load_userdict(os.path.join(homedir, "words.txt"))

执行时报如下错误:

  File "extract_tags.py", line 14, in <module>
    jieba.load_userdict(os.path.join(homedir, "words.txt"))
  File "/Library/Python/2.7/site-packages/jieba/__init__.py", line 381, in load_userdict
    f.name, lineno, line))
ValueError

loadUserDict自己的词库,SegMode.INDEX切词的结果错误了

在使用过WordDictionary.getInstance().loadUserDict自己的词库之前,可以切出5个“解除合同”单词,在加载自己的词库以后,只能切出2个“解除合同”了,词库加载有什么讲究吗?词库制作有什么需要注意的地方吗?

自定义词典包含空格时报错

Exception in thread "main" java.lang.NumberFormatException: For input string: "server"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
at java.lang.Double.valueOf(Double.java:504)
at com.huaban.analysis.jieba.WordDictionary.loadUserDict(WordDictionary.java:151)
at com.huaban.analysis.jieba.WordDictionary.loadUserDict(WordDictionary.java:134)

报错的词是sql server

关于词性问题

看了说明,好像是由于效率问题把POS功能给去掉了,请问JAVA的这几个版本中哪几个是带词性标注的,还是都没有添加该功能?

Android中模型加载速度慢

您好,我想把jieba用在Android项目中,但是模型加载的速度非常慢,通常需要15--20秒的时间。请问有什么方法可以使模型加载速度更快一些么?

关于 route的问题

Hi, 你好,在看你源代码。有点疑惑,来问一下你:
private Map<Integer, Pair> calc(String sentence, Map<Integer, List> dag)
这个函数里面:
double freq = wordDict.getFreq(sentence.substring(i, x + 1)) + route.get(x + 1).freq;
问什么会这么写呢? 相邻两个词的频率相加,而且是从后往前加,后面的频率会被传递到第一个词哪里去,有没有什么理论依据呢?求解啊

关于模型文件prob_emit.txt

这个模型文件怎么训练出来的啊。训练语料用的是?以及对应的训练代码有吗?我这边想更新一下这个模型文件。因为我用的分词器是电商领域。

关于停用词的实现

"#8" 中提到了jieba不会支持停用词. 如果要实现停用词的话, 应该如何着手? es-jieba中此功能是实现了lucene的createComponents接口返回StopFilter, 具体去停用词的逻辑应该在lucene内部.

在Solr5.5.2中使用结巴分词1.0.2时报错!

@piaolingxue 你好:我下载你的jieba-analysis和analyzer-solr(https://github.com/sing1ee/analyzer-solr.git)在Solr5.5.2中使用,配置文件如下:
< fieldType name="text_jieba" class="solr.TextField" positionIncrementGap="100">
< analyzer type="index">
< tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory" segMode="SEARCH"/>
< filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
< filter class="solr.LowerCaseFilterFactory"/>
< filter class="solr.SnowballPorterFilterFactory" language="English"/>
< /analyzer>
< analyzer type="query">
< tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory" segMode="SEARCH"/>
< filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
< filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
< filter class="solr.LowerCaseFilterFactory"/>
< filter class="solr.SnowballPorterFilterFactory" language="English"/>
< /analyzer>
< /fieldType>

报错信息如下:
2016/8/11 下午3:10:33 ERROR null HttpSolrCall null:java.lang.RuntimeException: java.lang.NoSuchFieldError: word
null:java.lang.RuntimeException: java.lang.NoSuchFieldError: word
at org.apache.solr.servlet.HttpSolrCall.sendError(HttpSolrCall.java:607)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:475)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:192)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:108)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)
.................
Caused by: java.lang.NoSuchFieldError: word
at analyzer.solr5.jieba.JiebaTokenizer.incrementToken(JiebaTokenizer.java:38)
at org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeTokenStream(AnalysisRequestHandlerBase.java:188)
at org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeValue(AnalysisRequestHandlerBase.java:127)
at org.apache.solr.handler.FieldAnalysisRequestHandler.analyzeValues(FieldAnalysisRequestHandler.java:220)
at org.apache.solr.handler.FieldAnalysisRequestHandler.handleAnalysisRequest(FieldAnalysisRequestHandler.java:181)
.............

请问我是哪里配置错了?该如何解决,谢谢!!!

UTF-8这样的词无法识别?

添加了自定义词典,如下:
UTF-8 3 nz
utf-8 3 nz

日志显示加载成功
user dict C:\Users\pc\Desktop\eclipse\workspace\neo4jlearn\conf\user.dict load finished, tot words:2, time elapsed:1ms
但是,被拆成了三个词 "utf","-"和"8"。

看了一下c++这种是可以的。
另外,英文会被转成小写么?

怎样支持日期?

感谢huaban这个Java版本,用起来不错!不过怎样支持日期呢?例如2014-08-28?

你好,我在Solr引擎中配置jieba分词器,请问有可以直接使用的Jar包么

你好,
1、我在Solr引擎中配置jieba分词器,请问有可以直接使用的Jar包么
2、另外jieba分词器在处理英文字符和数字一起的字符串时候默认是不能进行近一步分词的,我想把他分词出来,该如何修改源码?
比如SourceString = “C49D47618”,经过分词,我想得到如下分词:“C”、“49”、“D”、“47618”、“C49D47618”

中文+空格+英文

Hi 當我使用sentenceProcess,會出現
Exception in thread "main" java.lang.NullPointerException
at com.huaban.analysis.jieba.JiebaSegmenter.createDAG(JiebaSegmenter.java:28)
at com.huaban.analysis.jieba.JiebaSegmenter.sentenceProcess(JiebaSegmenter.java:158)

JiebaSegmenter segmenter = new JiebaSegmenter();
System.out.println(segmenter.sentenceProcess("黑 X").toString());

但是使用process

System.out.println(segmenter.process("黑 X", SegMode.INDEX).toString());

卻不會有這個例外錯誤。

謝謝

编译问题

import java.nio.file.DirectoryStream;
import java.nio.file.Files;
import java.nio.file.Path;

can't resolved

cannot resolve symbol 'junit'

提示cannot resolve symbol 'junit',maven中reimport还是不行。请问该如何解决呢?使用的编辑器是intellij

如何导入eclipse

我相拥eclipse加载这个工程,但是只加载的文件夹,没有工程,也无法执行
是缺少.classpath和.project文件吗?

Problems with highlighting in Solr when using Jieba anaylzer

Hi,
I'm using Jieba analyser to index Chinese characters in the Solr. It works fine with the segmentation when using the Anaylsis on the Solr Admin UI.

However, when I tried to do highlighting in Solr, it is not highlighting in the correct place. For example, when I search for 自然环境与企业本身, it highlight 认|em|为自然环|/em||em|境|/em||em|与企|/em||em|业本|/em|身的

Even when I search English character responsibility, it highlight |em| responsibilit|em|y.

Basically the highlighting goes off by 1 character/space consistently.
Anyone knows what could be the issue?

I'm using jieba-analysis-1.0.0, Solr 5.2.1 and Lucene 5.1.0

Regards,
Edwin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.