Giter Site home page Giter Site logo

sinaweibo-locationsignin-spider's Introduction

新浪微博签到页爬虫

image

1.功能简介

以城市为单位爬取新浪微博移动端POI下的所有微博,存入对应csv文件。爬取的信息有:

信息 含义
user_id 用户ID
user_name 昵称
gender 性别
tweets 微博文本
textLength 微博文本长度
created_at 发布时间
source 发布端
followers_count 粉丝数
follow_count 关注数
statuses_count 历史微博数
profile_url 主页链接
pic_num 图片数
pics_url 图片链接
reposts_count 转发数
comments_count 评论数
attitudes_count 态度数

2.文件说明

buildip.py,爬取 西刺高匿代理 构建代理池。
myemail.py,爬取完毕后发邮件给自己的邮箱。
wifi.py,确保网络连接不断开(网络断开后自动重连)。
crawler.py,爬虫本体。
config.ini,配置文件,配置项有邮箱,wifi名称,城市名称,城市编码。

3.程序思路

爬取网站为 新浪微博移动端 ,相对于PC端而言网页结构简单而且限制较少,而且签到页不需要模拟登录。

首先,爬取城市页面,比如武汉市的url为: https://m.weibo.cn/p/1001018008642010000000000 ,获取城市下的所有POI,写入.csv文件。

然后,读取生成的csv文件读出POI的name和id,再构造url爬取POI下的微博信息,url示例: https://m.weibo.cn/p/index?containerid=100101B2094655D464A3FF439C SinaWeibo Mobile 武汉市 SinaWeibo Mobile 黄鹤楼

4.使用方法

修改config.ini文件,email_address填写自己的邮箱,wifi填写已连接过的wifi名称,cityName填写爬取的城市名称,cityId填写城市编码。

城市编码参考新浪微博开放平台的 省份城市编码表, 举例如下:湖北省的省份编码为42,武汉市编码为1,则武汉市的编码为4201。值得注意的一点是:北京、上海、天津、重庆四个直辖市的编码后两位均为0,不再继续向下区分,北京市:1100,例如北京海淀区对应代码为1108,爬取不到内容。

5.依赖的第三方库

  • requests
  • pandas
  • configparser
  • fake_useragent

6.Contact Me

如果有什么建议,欢迎联系我 [email protected] 或提issue。欢迎star!

sinaweibo-locationsignin-spider's People

Contributors

wanzixin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

sinaweibo-locationsignin-spider's Issues

代理失效了

可否更换一下目前可用的代理?非常感谢!

超时问题

您好,请问在爬取代理的时候出现如下错误应该怎么解决呢?
----------------爬取代理使用的ip为: {'http': '223.241.119.42:47972'} --------------------
Traceback (most recent call last):
File "D:\Program Files\python36\lib\urllib\request.py", line 1318, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "D:\Program Files\python36\lib\http\client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "D:\Program Files\python36\lib\http\client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "D:\Program Files\python36\lib\http\client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "D:\Program Files\python36\lib\http\client.py", line 1026, in _send_output
self.send(msg)
File "D:\Program Files\python36\lib\http\client.py", line 964, in send
self.connect()
File "D:\Program Files\python36\lib\http\client.py", line 1392, in connect
super().connect()
File "D:\Program Files\python36\lib\http\client.py", line 936, in connect
(self.host,self.port), self.timeout, self.source_address)
File "D:\Program Files\python36\lib\socket.py", line 724, in create_connection
raise err
File "D:\Program Files\python36\lib\socket.py", line 713, in create_connection
sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:\Program Files\python36\lib\site-packages\fake_useragent\utils.py", line 67, in get
context=context,
File "D:\Program Files\python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "D:\Program Files\python36\lib\urllib\request.py", line 526, in open
response = self._open(req, data)
File "D:\Program Files\python36\lib\urllib\request.py", line 544, in _open
'_open', req)
File "D:\Program Files\python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "D:\Program Files\python36\lib\urllib\request.py", line 1361, in https_open
context=self._context, check_hostname=self._check_hostname)
File "D:\Program Files\python36\lib\urllib\request.py", line 1320, in do_open
raise URLError(err)
urllib.error.URLError:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "E:/Pycharm/weiboqiandao/crawler.py", line 255, in
main()
File "E:/Pycharm/weiboqiandao/crawler.py", line 229, in main
ippool = build_ippool()
File "E:\Pycharm\weiboqiandao\buildip.py", line 82, in build_ippool
results = p.get_proxy(page)
File "E:\Pycharm\weiboqiandao\buildip.py", line 37, in get_proxy
res = requests.get(url, proxies=proxy_ip, headers={'User-Agent': UserAgent(use_cache_server=False).random})
File "D:\Program Files\python36\lib\site-packages\fake_useragent\fake.py", line 69, in init
self.load()
File "D:\Program Files\python36\lib\site-packages\fake_useragent\fake.py", line 78, in load
verify_ssl=self.verify_ssl,
File "D:\Program Files\python36\lib\site-packages\fake_useragent\utils.py", line 250, in load_cached
update(path, use_cache_server=use_cache_server, verify_ssl=verify_ssl)
File "D:\Program Files\python36\lib\site-packages\fake_useragent\utils.py", line 245, in update
write(path, load(use_cache_server=use_cache_server, verify_ssl=verify_ssl))
File "D:\Program Files\python36\lib\site-packages\fake_useragent\utils.py", line 178, in load
raise exc
File "D:\Program Files\python36\lib\site-packages\fake_useragent\utils.py", line 154, in load
for item in get_browsers(verify_ssl=verify_ssl):
File "D:\Program Files\python36\lib\site-packages\fake_useragent\utils.py", line 97, in get_browsers
html = get(settings.BROWSERS_STATS_PAGE, verify_ssl=verify_ssl)
File "D:\Program Files\python36\lib\site-packages\fake_useragent\utils.py", line 84, in get
raise FakeUserAgentError('Maximum amount of retries reached')
fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached

项目咨询

尊敬的开发者,您好!
我最近正在follow你的项目以期分析地点数据,但在运行中有以下错误:

  1. 在buildip.py中函数get_proxy里的
    res = requests.get(url, proxies=proxy_ip, headers={'User-Agent': UserAgent(use_cache_server=False).random})
    报错TypeError: init() got an unexpected keyword argument 'use_cache_server',将这个意外实参移除之后又继续出现错误,同样是该句报错。
  2. res = requests.get(url, proxies=proxy_ip, headers={'User-Agent': UserAgent().random})
    requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))
    希望您有空帮处理或者更新下代码,十分感谢!

请求大佬更新一下代码

请问大佬这个代码可以维护更新一下吗?是否还可以利用这个代码爬取?我试了一下,无法爬取。先谢谢了

POI覆盖范围

目前看到代码中有关POI的获取是每个城市10页,是否有办法进行扩展呢?谢谢!

AttributeError: 'NoneType' object has no attribute 'group'

Traceback (most recent call last):
File "C:/Users/XXJ/PycharmProjects/pythonProject1/poicrawler/crawler.py", line 254, in
main()
File "C:/Users/XXJ/PycharmProjects/pythonProject1/poicrawler/crawler.py", line 234, in main
spider.get_poi(ippool)
File "C:/Users/XXJ/PycharmProjects/pythonProject1/poicrawler/crawler.py", line 64, in get_poi
pois_id.append(poi_id.group())
AttributeError: 'NoneType' object has no attribute 'group'

我运行了代码,提示我上面的错误。请问大佬知道原因吗?错误指向以下部分:

        res = requests.get(cityURL+'&page='+str(page),proxies = proxy_ip,headers = headers)
        if res.status_code == 200:
            info = json.loads(res.text)

            if info['ok'] == 1:
                card_group = info['data']['cards'][0]['card_group']
                print(card_group)
                print(len(card_group))
                for i in range(0,len(card_group)):
                    poi_id = re.search(r'100101B2094[A-Z0-9]{15}',card_group[i]['scheme'])

                    pois_id.append(poi_id.group())
                    pois_name.append(card_group[i]['title_sub'])
            else:
                print('这座城市poi已经爬取完毕了。')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.