Giter Site home page Giter Site logo

Comments (8)

shahuwang avatar shahuwang commented on August 20, 2024

fmt.Println(sego.SegmentsToString(segments, false))
把这句的false改为true就可以了。
我遇到的问题是加载词典的路径:github.com/huichen/sego/data/dictionary.txt
这样加载不到,需要修改为完整的路径才可以

from sego.

huichen avatar huichen commented on August 20, 2024

@brandyaptx

因为“中华人民共和国**人民政府”是一个词,如果你不希望出现这个词可以从词典中删除,或者就像@shahuwang说的那样使用true选项,“中华人民共和国**人民政府”会被分词为“中华人民共和国 / **人民政府”,而且这两词还可以被进一步拆分。见这里注释

https://github.com/huichen/sego/blob/master/token.go#L44

@shahuwang

是的,README.md中例子来自 https://github.com/huichen/sego/blob/master/tools/example.go ,需要指定完整路径。

from sego.

brandyaptx avatar brandyaptx commented on August 20, 2024

@huichen
用了true选项,结果是
中华/nz 人民/n 共和/nz 共和国/ns 人民共和国/nt 中华人民共和国/ns **/n 人民/n 政府/n 人民政府/nt **人民政府/nt 中华人民共和国**人民政府/nt
如果想直接得到最细的分词结果有什么方法吗?

from sego.

huichen avatar huichen commented on August 20, 2024

@brandyaptx

分词结果实际上是个树状结构(见我上面给的链接),true选项打印的实际上是深度优先遍历的节点。如果你需要最细的分词,只要深度优先遍历时仅选择叶子节点即可。

另外,你的实际应用是什么?对搜索来说,所有的节点可能都是有用的。

from sego.

brandyaptx avatar brandyaptx commented on August 20, 2024

我想试用一下word2vec,为此找个趁手的分词工具。

from sego.

huichen avatar huichen commented on August 20, 2024

这个word2vec看上去很好玩啊,你的中文训练数据从哪里来的?wiki?

我觉得对word2vec来说“中华人民共和国**人民政府”分成一个词也许没问题,因为这个词并不常见,可能对模型影响不大。

from sego.

brandyaptx avatar brandyaptx commented on August 20, 2024

@huichen 抱歉,这几天一直忙别的事,没有注意邮件。我的中文训练集是自己所在新闻网站的新闻,几万篇文档,word2vec的结果好像还可以,不过不知道除了挖掘近义词改进站内搜索的质量之外还可以有什么其他的应用?

from sego.

huichen avatar huichen commented on August 20, 2024

@brandyaptx 没实际用过word2vec,也许可以当做topic model,用word2vec生成其它机器学习系统的输入feature,比如对新闻做clustering,或者根据用户阅读历史推荐新闻什么的。

from sego.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.