Giter Site home page Giter Site logo

webdict's Introduction

webdict

###2014-05-11更新###

webdict更新,从1GB果壳语料、750MB豆瓣语料、2GB腾讯新闻语料、2.5GB腾讯财经语料、500MB腾讯科技语料中进行新词发现,增加词语19739个。

webdict_with_freq.txt目前包含220934个词条

tagger.txt目前包含标注41685个

P.S. 写爬虫爬网页什么的都是泪啊QAQ

###2014-02-22更新###

webdict的第一个词库发布

webdict.txt是不包含词频的词库, 总共201195个词条。

webdict_with_freq.txt是包含了词频的词库, 总共154967个词条, 统计词频所用语料库总共1583096137个词。

两个词库使用到了CC-CEDICT的词表(CC BY-SA 3.0协议)

###2014-01-05更新###

截止今天webdict.info已经收集到了28923条词语标注,标注结果已经合并至tagged.txt。

目前正准备由Twitter语料切换到新闻和Twitter的混合语料中进行新词发现。

###2013-09-16更新###

webdict.info已经收集到了18849条词语标注,其中6346个是词语,已经合并至wendict.txt。

新增词语标注文件tagged.txt。

webdict's People

Contributors

ling0322 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

webdict's Issues

建议,不要用是或否,而是列表

人选东西有惯性的,如果一直都不是一个词可能会漏掉,或许可以给出多个候选的词语的列表,然后让人选哪些是一个词语,提交了再下一组,这样不容易选错,而且错了,只要没提交还可以改。

关于年份

词典中出现了很多某某年,虽然是词,但是在是不是不应该加入词典,不然这词典得多大?类似的还有某某日,某某元,实际可以设置简单的规则,数词和量词以及时间单位结合,除了有特殊意义的,不加入词典。
一九一一年
一九一七年
一九一九年
一九一八年
一九一四年
一九七一年
一九七七年
一九七三年
一九七九年
一九七二年
一九七五年
一九七八年
一九七六年
一九七四年
一九三一年
一九三七年

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.