Giter Site home page Giter Site logo

Chen Ruoyi's Projects

text-prepocessing icon text-prepocessing

文本预处理包括分词、标注词性、对(屈折变化性词类)进行归类分析(lemmatization)、去除停用词,根据本项目需要,还需要去除中文及中文符号。 首先,输入一个txt文件,转化为string类型,并新建一个空数组等待放置词汇。英文分词较为简单,只需要根据空格进行分词即可。 将英文单词进行词干提取,在比较了Lancaster Stemmer、Porter Stemmer、Word Net Lemmatizer方法后,认为Lemmatize可以更好的还原 词汇本身,因此选择Lemmatization方法。由于调用的WordNetLemmatizer模块需要输入词汇原型以及词汇的类型信息,我们需要先 对词汇进行词性标注。获取原型词汇&词性的二维数组后,去除词汇中的停用词。由于我们的目标是分析雅思阅读词汇, 在OCR产生的脏txt数据中可能包含中文(如首尾页的说明、绪论等),我们还需要去除中文及中文词汇。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.