Giter Site home page Giter Site logo

telegramtextdealer's Introduction

telegramTextDealer

telegram spider aiming at Chinese-based groups, with texts-cleaning and jieba-cutting stuff ---> 针对telegram群组的中文记录爬取、文本清洗及分词等处理

完成时间:2020/08

01 爬取指定telegram群组

spider功能由Kosat提供,repo在这里。根据他的提示,安装telegram-messages-dump包后即可工作。(前提:有自己的telegram账号,要接收验证码)
但是!他的只能通过一个命令行指定爬取一个telegram群组。因此我们改进了一下,希望给定group.txt来指明一个群组列表。
因此,你可以这样开始:
1.安装包:

pip install telegram-messages-dump

2.爬取group.txt中指定的所有群组:

python whichGroup.py

备注:
尝试发现本机挂代理仍可能有连接问题。建议放外国服务器上跑脚本。另外,爬取记录时的有关设置可以在whichGroup.py中更改,目前设置的是爬取群组建立以来的所有历史消息:

nowcmd="telegram-messages-dump -c@"+item+" -p +8612345678901 -l 0 -o result"+str(nowid)+".log"

可选参数等信息,从Kosatrepo介绍中拷贝过来了,供大家查阅:

telegram-messages-dump -c <chat_name> -p <phone_num> [-l <count>] [-o <file>] [-cl]

Where:
    -c,  --chat     Unique name of a channel/chat. E.g. @python.
    -p,  --phone    Phone number. E.g. +380503211234.
    -o,  --out      Output file name or full path. (Default: telegram_<chatName>.log)
    -e,  --exp      Exporter name. text | jsonl | csv (Default: 'text')
      ,  --continue Continue previous dump. Supports optional integer param <message_id>.
    -l,  --limit    Number of the latest messages to dump, 0 means no limit. (Default: 100)
    -cl, --clean    Clean session sensitive data (e.g. auth token) on exit. (Default: False)
    -v,  --verbose  Verbose mode. (Default: False)
      ,  --addbom   Add BOM to the beginning of the output file. (Default: False)
    -h,  --help     Show this help message and exit.

这时群组记录就爬取下来了。放进telegram文件夹中。本例中我在telegram文件夹中已经提供了result0.log,result1.log和result2.log。

02 文本清洗

需要先手动建立tempTexts和cleanTexts两个空文件夹。

python text-cleaning.py

此时会从telegram中读取文件,经过去特殊符号、去标点符号、去数字字母、去空格、繁转简等操作后将纯简体中文字符文本存入tempTexts文件夹,命名如result0_temp.log;
再去除空行后存入cleanTexts文件夹,命名如result0_clean.log。
我给的例子中目前是两个空文件夹,你可以尝试运行得到处理后文本。

03 结巴分词、去停用词:

python jiebacut.py

从cleanTexts中读取文件如result0_clean.log进行处理并覆盖文本。

telegramtextdealer's People

Contributors

m1-llie avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.