Giter Site home page Giter Site logo

qqzone_crawler's Introduction

QQzone_crawler

QQ 空间动态爬虫,利用cookie登录获取所有可访问好友空间的动态并保存到本地

需要先安装第三方库 requests
本程序使用的是python3.5,在Linux下完成。由于自己的电脑上同时有python2.7和python3.5,默认是python2。所以在每个程序头部我写的都是

#!/usr/bin/env python3

由于程序使用from urllib import parse,利用parse模块来构造URL,所以如果使用python2的朋友需要在对应的地方修改,此外print语句也是要相应修改的。

各程序文件说明

main.py: 程序主入口,运行时执行python3 main.py即可

get_my_friends.py: 用于从QQ空间服务器获取包括自己的QQ好友信息的文件,其中包括他们的QQ号和名称(此处是备注名),保存到本地,每个文件中保存有50个。每完成一个文件请求后,会暂停5秒。在程序运行时,会自动将这些文件保存在friends文件夹中。

get_qq_number.py: 用于从上一步保存好的文件中提取出所有好友的QQ号和名称,QQ号和名称以字典形式保存,再以它们组成的字典为作元素构造列表,再保存到本地,文件名为qqnumber.inc

get_moods.py: 用于从QQ空间服务器获取包含每个好友空间发表的说说的文件,其中包含每个说说的发表时间、内容、地点信息、手机信息等,保存到本地,每个文件中保存20条信息。每完成一个文件请求后,会暂停5秒。在程序运行时,会自动将这些文件保存在mood_result文件夹中。

cookie_file: 用于放置自己登录QQ空间后得到的cookie。从浏览器中复制出来放在这个文件内即可,在负责处理cookie的函数中有对应的处理代码来处理换行符,但还是希望不要出现多行,末尾也不要有多余的空行。但要注意的是,这个文件里面只能放一个cookie。它的作用是方便设置cookie,而不是用于反反爬虫。 如果不知道怎么获取cookie,请看这里


可视化部分

operate_table.py:这个程序创建用于保存说说信息的数据库。里面写了创建数据表和删除数据表的两个函数。需要单独执行。 创建数据表:

python3 operate_table.py create_table

删除数据表:

python3 operate_table.py drop_table

get_moods_detail.py:程序在执行完 get_moods.py 中的功能之后,会把包含有每个好友的说说文件保存到本地。而这个程序就是用于把说说信息从这些文件里面提取出来,放到sqlite数据库里面去的。这个程序需要单独执行。执行完后在当前目录下会生成 moods.sqlite 数据库文件。本程序需要在成功执行 operate_table.py 程序创建数据表后执行

get_single_report:这个是个 Web 程序,用于在浏览器中查看指定好友说说的简单报告。也需要单独执行,并且必须要在执行完 get_moods_details.py 文件以生成 moods.sqlite 数据库文件,这个 web 程序才可以正确执行。直接执行本文件夹中的index.py即可。需要先安装 flask、pandas、sqlalchemy、jieba、wordcloud 这几个库。执行 python3 index.py,在浏览器中输入 http://localhost:5000/qqnum=QQ号码 就可以查看到结果了

get_word.py:用来生成词云,背景图为 mask.jpg,为QQ空间的五角星。ttf文件路径请根据自己系统修改。本程序不需要单独执行。

注意事项

  1. 获取QQ好友信息是间接获取的。需要先在QQ空间中将自己空间的访问权限先设置为仅QQ好友可访问。然后程序才能够正常运行

  2. 最终获取到的各好友的空间动态会以文件形式保存在以其QQ号为名的文件夹当中(它们又位于mood_result文件夹中)。它们是由QQ空间服务器返回的文件,还需要自行进行处理才能得到自己想要的信息。其实内容的格式已经很接近JSON了 2.1 更新后的版本,可以通过依次执行operate_table.py、get_moods_detail.py两个程序来把动态保存在sqlite数据库文件中

  3. 在get_moods_detail.py程序中,我只提取了当时所需要的部分信息,而不是与说说相关的所有信息。有需要其它信息的还要自己去operate_table.py中修改创建数据表的函数以及在get_moods_detail.py程序中修改提取说说信息的函数

  4. 程序开始运行后,会产生一个日志文件crawler_log.log,它记录了程序运行期间的一些必要的信息,比如什么时候抓取到了哪个号码的空间,这个空间能不能被访问等

  5. 在创建了数据库表后,如果有需要重新执行提取动态插入数据库表的操作的话,建议先删除原表,再执行提取

qqzone_crawler's People

Contributors

lucky-zwx avatar xjr7670 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

qqzone_crawler's Issues

friends

获取的好友列表为0kb

运行index.py出现的错误

  • Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
  • Restarting with stat
  • Debugger is active!
  • Debugger PIN: 317-819-167
    127.0.0.1 - - [30/Jan/2018 15:13:38] "GET /qqnum=983267850 HTTP/1.1" 500 -
    Traceback (most recent call last):
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1997, in call
    return self.wsgi_app(environ, start_response)
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1985, in wsgi_app
    response = self.handle_exception(e)
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1540, in handle_exception
    reraise(exc_type, exc_value, tb)
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/_compat.py", line 33, in reraise
    raise value
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/_compat.py", line 33, in reraise
    raise value
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functionsrule.endpoint
    File "/home/moonstar/github/QQzone_crawler/get_single_report/index.py", line 66, in index
    total_info['first_mood'] = make_date(first_mood)
    File "/home/moonstar/github/QQzone_crawler/get_single_report/util.py", line 26, in make_date
    res = time.gmtime(utstime)
    ValueError: Invalid value NaN (not a number)
    127.0.0.1 - - [30/Jan/2018 15:13:38] "GET /qqnum=983267850?debugger=yes&cmd=resource&f=style.css HTTP/1.1" 200 -
    127.0.0.1 - - [30/Jan/2018 15:13:38] "GET /qqnum=983267850?debugger=yes&cmd=resource&f=jquery.js HTTP/1.1" 200 -
    127.0.0.1 - - [30/Jan/2018 15:13:38] "GET /qqnum=983267850?debugger=yes&cmd=resource&f=debugger.js HTTP/1.1" 200 -
    127.0.0.1 - - [30/Jan/2018 15:13:39] "GET /qqnum=983267850?debugger=yes&cmd=resource&f=console.png HTTP/1.1" 200 -
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functionsrule.endpoint
    File "/home/moonstar/github/QQzone_crawler/get_single_report/index.py", line 66, in index
    total_info['first_mood'] = make_date(first_mood)
    File "/home/moonstar/github/QQzone_crawler/get_single_report/util.py", line 26, in make_date
    res = time.gmtime(utstime)
    ValueError: Invalid value NaN (not a number)
    127.0.0.1 - - [30/Jan/2018 15:13:38] "GET /qqnum=983267850?debugger=yes&cmd=resource&f=style.css HTTP/1.1" 200 -
    127.0.0.1 - - [30/Jan/2018 15:13:38] "GET /qqnum=983267850?debugger=yes&cmd=resource&f=jquery.js HTTP/1.1" 200 -
    127.0.0.1 - - [30/Jan/2018 15:13:3* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
  • Restarting with stat
  • Debugger is active!
  • Debugger PIN: 317-819-167
    127.0.0.1 - - [30/Jan/2018 15:13:38] "GET /qqnum=983267850 HTTP/1.1" 500 -
    Traceback (most recent call last):
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1997, in call
    return self.wsgi_app(environ, start_response)
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1985, in wsgi_app
    response = self.handle_exception(e)
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1540, in handle_exception
    reraise(exc_type, exc_value, tb)
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/_compat.py", line 33, in reraise
    raise value
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/_compat.py", line 33, in reraise
    raise value
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functionsrule.endpoint
    File "/home/moonstar/github/QQzone_crawler/get_single_report/index.py", line 66, in index
    total_info['first_mood'] = make_date(first_mood)
    File "/home/moonstar/github/QQzone_crawler/get_single_report/util.py", line 26, in make_date
    res = time.gmtime(utstime)
    ValueError: Invalid value NaN (not a number)
    127.0.0.1 - - [30/Jan/2018 15:13:38] "GET /qqnum=983267850?debugger=yes&cmd=resource&f=style.css HTTP/1.1" 200 -
    127.0.0.1 - - [30/Jan/2018 15:13:38] "GET /qqnum=983267850?debugger=yes&cmd=resource&f=jquery.js HTTP/1.1" 200 -
    127.0.0.1 - - [30/Jan/2018 15:13:38] "GET /qqnum=983267850?debugger=yes&cmd=resource&f=debugger.js HTTP/1.1" 200 -
    127.0.0.1 - - [30/Jan/2018 15:13:39] "GET /qqnum=983267850?debugger=yes&cmd=resource&f=console.png HTTP/1.1" 200 -
    File "/home/moonstar/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functionsrule.endpoint
    File "/home/moonstar/github/QQzone_crawler/get_single_report/index.py", line 66, in index
    total_info['first_mood'] = make_date(first_mood)
    File "/home/moonstar/github/QQzone_crawler/get_single_report/util.py", line 26, in make_date
    res = time.gmtime(utstime)
    ValueError: Invalid value NaN (not a number)
    127.0.0.1 - - [30/Jan/2018 15:13:38] "GET /qqnum=983267850?debugger=yes&cmd=resource&f=style.css HTTP/1.1" 200 -
    127.0.0.1 - - [30/Jan/2018 15:13:38] "GET /qqnum=983267850?debugger=yes&cmd=resource&f=jquery.js HTTP/1.1" 200 -
    127.0.0.1 - - [30/Jan/2018 15:13:38] "GET /qqnum=983267850?debugger=yes&cmd=resource&f=debugger.js HTTP/1.1" 200 -
    127.0.0.1 - - [30/Jan/2018 15:13:39] "GET /qqnum=983267850?debugger=yes&cmd=resource&f=console.png HTTP/1.1" 200 -
    8] "GET /qqnum=983267850?debugger=yes&cmd=resource&f=debugger.js HTTP/1.1" 200 -
    127.0.0.1 - - [30/Jan/2018 15:13:39] "GET /qqnum=983267850?debugger=yes&cmd=resource&f=console.png HTTP/1.1" 200 -

selenium ... TimeoutException: Message: timeout

when i try to run QQzone.py, it always occured the error as below:

File "QQZone.py", line 340, in
capture_data()
File "QQZone.py", line 333, in capture_data
sp.login()
File "QQZone.py", line 62, in login

self.web.get('https://user.qzone.qq.com/{}'.format(self.__username))

File "/home/jerry/py3_virtual_env/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 324, in get
self.execute(Command.GET, {'url': url})
File "/home/jerry/py3_virtual_env/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 312, in execute
self.error_handler.check_response(response)
File "/home/jerry/py3_virtual_env/lib/python3.5/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)

selenium.common.exceptions.TimeoutException: Message: timeout

(Session info: chrome=65.0.3325.181)
(Driver info: chromedriver=2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7),platform=Linux 4.13.0-37-generic x86_64)

how can i fix it? Is there any problem with my own network speed of crawling the webpage?

爬不了所有内容

我试过你的代码了,爬虫只可以爬到一小部分的数据,不能爬到好友历史的所有说说,怎么破?

词云分析

我这几天给你再加上词云的分析,再分析一下发送时间波动

Response [403]

在遇到了反爬取机制,请问有什么好的处理方法吗。
def init(self):

    self.headers = util.headers
    self.base_url = util.parse_friends_url()
    util.check_path('friends')
    print('开始获取好友列表,并把文件保存到 friends 文件夹')

def get_friends(self):

    key = True
    position = 0
    while key:
        url = self.base_url + '&offset=' + str(position)
        referer = 'http://qzs.qq.com/qzone/v8/pages/setting/visit_v8.html'
        self.headers['Referer'] = referer

        print("\tDealing with position\t%d." % position)
        res = requests.get(url, headers=self.headers)
        print(url)
        html = res.text
        print(res)
        with open('friends/offset' + str(position) + '.json', 'w', encoding='utf-8') as f:
            f.write(html)

        # check whether the friend list is over
        # if that, the uinlist is void list
        with open('friends/offset' + str(position) + '.json', encoding='utf-8') as f2:
            con = f2.read()

我输出了一下res,显示为403

出错

您好,我运行出现如下错误,请问是什么原因

UnicodeEncodeError: 'gbk' codec can't encode character '\U0001f31f' in position 587: illegal multibyte sequence

指定一些爬取限制条件

你好,我想在爬取的时候加一条时间限制,限制在2017年开始到2018年春节之前,然后做一个词云,该在哪里修改代码?

好友文件0kb

权限设置仅好友,cookies也设置无误,offset文件一直0kb.不知道该怎么解决

运行时出错

line 90, in parse_friends_url
if qqnumber[0] == 0:
IndexError: string index out of range

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.