Light

wchmei / hive Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dangsh/hive

0.0 1.0 0.0 39.52 MB

lots of spider (很多爬虫）

Python 100.00%

hive's Introduction

hive (虫巢)

概述

这是一个我个人用于练习python爬虫的项目：
1.用于自己练习
2.可供大家参考

环境依赖

python 3.6.2

####以下为可选:
scrapy
requests
bs4
pandas
MongoDB
MySQL 5.7

部署步骤

安装python3.6.2
通过pip安装 scrapy , requests , bs4 , pymysql 等模块

目录结构描述

├── beautiSoupSpider // bs4爬虫
├── scrapySpider // scrapy爬虫
├── newSpider // 使用requests完成的其它功能
└── README.md // 帮助文档

V0.0.1 版本内容 beautiSoupSpider

qidainTop500
用于爬取起点月票榜前500名的，书名，作者，类型，简介，更新，书籍url
doubanTop250
用于爬取豆瓣电影榜前250名的，影片名字，导演，评分，评价人数等信息已经初步完成
doubanBook
用于爬取豆瓣图书榜前250名的，书籍名字，作者，评分，评价人数等信息已经初步完成
lagou
目前仅完成了单独python类工作的爬取，正在进行改良，以抓取全部的类型。还需要学习怎么应对反爬虫的方法
lagouPlusP
在lagou的基础上升级了，用于爬取所有工作的信息，不只是python工作的信息是一只大型的爬虫，
但是还需要再进行优化来提高速度，因为上万条数据需要的时间太久之后会学习分布式爬虫，使用代理池，等方法进行优化
lagouPlusP.xlsx文件储存了简单测试的结果，gongzuo.xlsx储存了所有的（315条）工作类型网址和名称
5.1 经过改良之后：
理论上应该可以爬取到整个网站15000条左右的招聘信息
进行了简单的测试，爬取到了3000多条数据储存在 lagouPlusp.xlsx文件之中
在之后还会对这个爬虫进行一些改良

V0.0.2 版本内容 scrapySpider

First
利用scrapy框架重写了拉钩爬虫，极大程度上提高了性能十秒左右可以爬取到4000条数据
并且做了简单的反反爬虫处理还会进行一些升级
First
升级了First的功能，添加了反反爬虫，20分钟左右爬取到了8万条数据.速度感人，并且没有被屏蔽
添加了一个honey文件夹，存储爬取到的数据，以后数据分析可以使用
xxx
用于练习middlewares 和 webdriver PhantomJS 爬取动态数据
mongoTest
测试将爬取的数据存储在MongoDB中，学习MongoDB的用法
company
获取黄页上软件行业中的信息，企业名称，类型，法人信息等。实现了大部分功能 2017 11 28
将数据放入MongoDB中，练习MongoDB的使用
再次更新后添加了换页的操作，可以获得所有页数上的数据，爬虫基本完成了
将数据存入了MongoDB中，但是为了方便查看又运行了一次，放入json文件，3700余条数据
2017 12 1
KrSpider
使用chromeDriver模拟浏览器操作，爬取36kr的实时资讯
companyP
改进爬虫，爬取整个网站的信息

V0.0.3 版本内容 newSpider

bilibili
请求接口，获取bilibili直播的弹幕内容，简单实现
novel
全站爬虫，爬取全书网所有小说信息及内容并入库
spider2 更新了spider，可以获取全站的数据，并且保存到数据库
2017 12 25
jd
使用splinter,模拟登录jd，秒杀商品
weibo
使用splinter,模拟登录weibo，自动转发微博

</2017><2018>Make everyday better

hive's People

Contributors

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.