Giter Site home page Giter Site logo

asynccrawl's Introduction

Python基于协程的异步爬虫

简介

本项目核心部分来自《500 lines or less 项目》,原作者是Mongodb的工程师A. Jess Jiryu Davis 与Python之父Guido van Rossum,项目代码使用MIT协议 传统计算机科学往往将大量精力房子啊如何追求更有效率的算法上,但如今大部分的涉及网络的程序,他们的时间开销主要是在维持多个socket连接上,亦或是它们的时间循环处理不够高效导致了更多的时间开销。
对于这些程序来说,他们面临的挑战是如何更高效地等待大量的网络时间进行并行调度。目前比较流行的方法就是异步IO

程序列表

  • thready.py自定义线程池并发爬虫
  • thready2.py内置电池线程池并发爬虫
  • callback.py非阻塞事件循环爬虫
  • coroutine.py事件循环协程爬虫

线程池、回调、协程

我们希望通过并发执行来加快爬虫抓取页面的速度,一般实现方式有三种

  • 线程池方式:开一个线程池,没发现一个新链接就将链接放入任务队列中,线程池中的线程从任务队列获取一个链接,之后建立socket,完成抓取页面、解析、将新链接放入工作队列的步骤
  • 回调方式:程序会有一个主循环叫时间循环,在实践循环中会不断获得事件,通过在事件上注册解除回调函数k来达到多任务的并发效果。缺点是任务一旦需要的回调较多代码就会非常散难以维护
  • 协程方式:同样通过时间循环,利用生成器特性,generator能在中途停止之后恢复,那么原本不得不分开写的回调函数就可以写在一个generator里了

asynccrawl's People

Contributors

evanmu96 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.