fishtn / hoopa Goto Github PK
View Code? Open in Web Editor NEW一个轻量、快速的异步分布式爬虫框架
License: Apache License 2.0
一个轻量、快速的异步分布式爬虫框架
License: Apache License 2.0
你如果不信我的,你说你那样的封装使用起来简单,那我们来比比豆瓣top250电影的爬取 https://movie.douban.com/top250 ,列表页调度到详情页,这是非常具有代表性的例子,看看谁的简单,谁花的代码 行数少。
from function_scheduling_distributed_framework import task_deco, BrokerEnum
import requests
from parsel import Selector
@task_deco('douban_list_page_task_queue', broker_kind=BrokerEnum.PERSISTQUEUE,qps=0.1) # qps 自由调节精确每秒爬多少次,远强于一般框架只能指定固定的并发线程数量。
def craw_list_page(page):
""" 豆瓣列表页,获取列表页电影链接"""
url = f'https://movie.douban.com/top250?start={page * 25}&filter='
resp = requests.get(url, headers={'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)', })
sel = Selector(resp.text)
for li_item in sel.xpath('//*[@id="content"]/div/div[1]/ol/li'):
movie_name = li_item.xpath('./div/div[2]/div[1]/a/span[1]/text()').extract_first()
movei_detail_url = li_item.xpath('./div/div[2]/div[1]/a/@href').extract_first()
craw_detail_page.push(movei_detail_url, movie_name)
@task_deco('douban_detail_page_task_queue', broker_kind=BrokerEnum.PERSISTQUEUE, qps=0.2)
def craw_detail_page(detail_url, movie_name):
"""豆瓣详情页,获取电影的详细剧情描述。"""
resp = requests.get(detail_url, headers={'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)', })
sel = Selector(resp.text)
description = sel.xpath('//*[@id="link-report"]/span[1]/text()[1]').extract_first().strip()
print('保存到数据库:',movie_name, detail_url, description)
if __name__ == '__main__':
# craw_list_page(0)
# craw_detail_page('https://movie.douban.com/subject/6786002/','触不可及')
for p in range(10):
craw_list_page.push(p)
craw_list_page.consume()
craw_detail_page.consume()
如果已经用requests写好了的函数,这个不兼容,为什么非要学scrapy继承一个爬虫基类,然后模仿scrapy的钩子函数呢,想不通。。。
这个实现了分布式队列和并发,和确认消费。目前加上请求客户端就可以维护成一个爬虫框架,这个客户端是方便一键切换各种代理ip。不然的话,你目前这种封装方式的并发运行速度一定不如我,如果你不信的话,可以本机安装个nginx后面不代理接口。直接请求nginx80端口,我们对比下运行速度。
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.