Giter Site home page Giter Site logo

lianjia's People

Contributors

wenpeiyu avatar xjkj123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

lianjia's Issues

HoleCityDown()的改进避免抓取过多数据

每个区域的经纬度数组仅在刚调用HoleCityDown时初始化一次,之后对每个区域的抓取都会叠加上一个区域的经纬度数组,数组越来越大,导致重复抓取

改进:
lat = []
lng = []
应当移动到
for x in area_list:
以内

如何处理抓取过程卡住的问题

问题如下图所示:

image

每次到第二步的时候都会卡在某处,如图是卡在了这个区域的70%处,即使挂一天都不再动了。

只能重启程序然后重新爬,但是又会卡在某处。 第二步我跑了三天了还是没有顺利爬完。

请问怎么解决这个问题,怎么从上次卡住的地方接着爬取,而不是只能重新开始。

抓取过程中会中断,该如何处理呀?

100%|███████████████████████████████████████| 2460/2460 [08:58<00:00, 4.88it/s]
100%|███████████████████████████████████████| 2580/2580 [09:10<00:00, 4.39it/s]
57%|██████████████████████▍ | 2139/3721 [07:40<05:45, 4.58it/s]100%|███████████████████████████████████████| 3721/3721 [13:28<00:00, 4.82it/s]
7%|██▍ | 1504/22010 [05:22<1:12:00, 4.75it/s]100%|███████████████████████████████████| 22010/22010 [1:22:34<00:00, 4.49it/s]
0%| | 0/8116 [00:00<?, ?it/s]Traceback (most recent call last):
File "test.py", line 7, in
lj.GetCompleteHousingInfo(city)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/Lianjia/lianjia.py", line 505, in GetCompleteHousingInfo
ret = Lianjia(city).GetHousingInfo(x[0], x[1])
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/Lianjia/lianjia.py", line 386, in GetHousingInfo
for x in house_json['data']['ershoufang_info']['list']:
TypeError: 'NoneType' object is not subscriptable

GetCompleteHousingInfo()出错

直接执行代码lianjia.py,前面两步正常,执行到第三步GetCompleteHousingInfo()时会直接报错,错误代码如下:
Traceback (most recent call last):
File "lianjia.py", line 351, in
GetCompleteHousingInfo(city)#获取详细在售房屋
File "lianjia.py", line 332, in GetCompleteHousingInfo
ret = Lianjia(city).GetHousingInfo(x[0], x[1])
File "lianjia.py", line 214, in GetHousingInfo
for x in house_json['data']['ershoufang_info']['list']:
TypeError: 'NoneType' object is not subscriptable

此时如果切换使用pip包的方式调用执行第三步
import Lianjia.lianjia as lj
city='上海'
lj.GetCompleteHousingInfo(city)
则可以接着执行,不会出现上述错误

将出错代码前一行的ret.text内容print出来为:
jQuery111106822012072868358_1534402288206({"request_id":"1914311585","uniq_id":"D41F-612E-203B-50C1-CA3355E268DD","errno":10001,"error":"invalid request","data":null})
似乎是Get请求失败了?

linux和windows测试现象都是这样,换了网络也是不行

爬取数据较少的问题

您好,很感谢您能分享您的代码。我在运行时发现detailinfo中的数据和lianjia_area中count相加的数据量不符。例如北京的detailinfo中我得到了6180条数据,而lianjia_area中count相加有80347条数据,运行过程中并无中断,想请问一下这样正常吗?

整个数据获取的过程是有漏洞的

首先,链家现在会封ip的,然后按照你这里的逻辑最后抓取的数据会有很大问题,以武汉为例
  1. 第一步是要有武汉的经纬度信息和区域编码来获取武汉下面的区一级信息,比如汉口,武昌,这些都是国家统一标准,很好获取,这里没问题。
  2. 第二步你拿到区一级信息,用区一级的边界属性以类似打点法的手段去循环请求来获取该区下小区的信息,这个时候问题就出来了,链家返回是该坐标值附近一定范围内的小区信息,包括不在该区内的,而且链家后台可能有分表分库处理,小区的id会重复,这就导致不同的数据相互覆盖或者相同的数据重复。
  3. 在获取小区和房屋信息时都存在的一个问题是服务器返回的json字符串的格式不是固定的,表示数据的部分有时候是obj有时候是array,统一的获取方法会导致数据丢失。
    我自己参照你的代码用java重写了一份,解决了上述问题,但是被封ip还有一些性能问题没有解决
    我的项目

请求支持郑州

目前支持城市中没有郑州,不知道是否存在可能支持?

非常感谢

我想问一下其他城市现在您更新了吗?

跑了您的代码我觉得非常nice,很感谢您.
是这样,我想爬取全国的信息,但city_dict里面目前只有5个城市参数,如果您还没有更新,可以说一下那四个参数最大、小经纬度是怎么获取的吗,目前我只找到了city_id.如果您代码更新了可以分享一下吗.再次感谢!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.