Giter Site home page Giter Site logo

qunar's Introduction

Python3 网络爬虫开发实战

本书介绍了如何利用 Python 3 开发网络爬虫。书中首先详细介绍了环境配置过程和爬虫基础知识;然后讨论了 urllib、requests 等请求库,Beautiful Soup、XPath、pyquery 等解析库以及文本和各类数据库的存储方法;接着通过多个案例介绍了如何进行 Ajax 数据爬取,如何使用 Selenium 和 Splash 进行动态网站爬取;接着介绍了爬虫的一些技巧,比如使用代理爬取和维护动态代理池的方法,ADSL 拨号代理的使用,图形、 极验、点触、宫格等各类验证码的破解方法,模拟登录网站爬取的方法及 Cookies 池的维护。 此外,本书还结合移动互联网的特点探讨了使用 Charles、mitmdump、Appium 等工具实现 App 爬取 的方法,紧接着介绍了 pyspider 框架和 Scrapy 框架的使用,以及分布式爬虫的知识,最后介绍了 Bloom Filter 效率优化、Docker 和 Scrapyd 爬虫部署、Gerapy 爬虫管理等方面的知识。

本书由图灵教育 - 人民邮电出版社出版发行,版权所有,禁止转载。

作者:崔庆才

购买地址:

加读者群:

视频资源:

Python3 爬虫三大案例实战分享

自己动手,丰衣足食!Python3 网络爬虫实战案例

qunar's People

Contributors

germey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

qunar's Issues

爬取游记的结果显示FETCH_ERROR

这个页面显示FETCH_ERROR

image

但是点进某个爬取结果的时候,发现却是成功的。

image

而且爬取第一个页面后就停止了,不再继续爬下去。该怎么解决这些问题?

image

加入fetch_type='js'后,报错Http 599

加入fetch_type='js'后,部分Response报错,部分成功,报错内容如下:
[E 210127 14:28:16 base_handler:203] HTTP 599: Empty reply from server
Traceback (most recent call last):
File "/home/dbrandom/anaconda3/lib/python3.8/site-packages/pyspider/libs/base_handler.py", line 196, in run_task
result = self._run_task(task, response)
File "/home/dbrandom/anaconda3/lib/python3.8/site-packages/pyspider/libs/base_handler.py", line 175, in _run_task
response.raise_for_status()
File "/home/dbrandom/anaconda3/lib/python3.8/site-packages/pyspider/libs/response.py", line 172, in raise_for_status
six.reraise(Exception, Exception(self.error), Traceback.from_string(self.traceback).as_traceback())
File "/home/dbrandom/anaconda3/lib/python3.8/site-packages/six.py", line 702, in reraise
raise value.with_traceback(tb)
File "/home/dbrandom/anaconda3/lib/python3.8/site-packages/pyspider/fetcher/tornado_fetcher.py", line 499, in phantomjs_fetch
response = yield gen.maybe_future(self.http_client.fetch(request))
File "/home/dbrandom/anaconda3/lib/python3.8/site-packages/tornado/httpclient.py", line 101, in fetch
response = self._io_loop.run_sync(functools.partial(
File "/home/dbrandom/anaconda3/lib/python3.8/site-packages/tornado/ioloop.py", line 458, in run_sync
return future_cell[0].result()
File "/home/dbrandom/anaconda3/lib/python3.8/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "", line 4, in raise_exc_info
Exception: HTTP 599: Empty reply from server

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.