Giter Site home page Giter Site logo

webspider's Introduction

Lagou Spider

Build Status Coverage Status License Maintainability Python

-- --
Version 1.0.1
WebSite http://www.jobinfo.cc:8000/
Source https://github.com/GuozhuHe/webspider
Keywords Python3, Tornado, Celery, Spider, Lagou, Requests

关于本系统

本系统是一个主要使用python3, celeryrequests来爬取职位数据的爬虫,实现了定时任务,出错重试,日志记录,自动更改Cookies等的功能,并使用ECharts + Bootstrap 来构建前端页面,来展示爬取到的数据。

展示页面

Alt text

Quick Start

以下操作均是在 Linux - Ubuntu 环境下执行

  • 克隆项目
git clone [email protected]:GuozhuHe/webspider.git
  • 安装 MySQL, Redis, Python3
# 安装 redis
apt-get install redis-server

# 后台启动 redis-server
nohup redis-server &

# 安装 python3
apt-get install python3

# 安装 MySQL
apt-get install mysql-server

# 启动 MySQL
sudo service mysql start
  • 配置数据库和表
# 创建数据库
CREATE DATABASE `spider` CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# 还需要创建相关表,表的定义语句在 tests/schema.sql 文件中,可自行复制进 MySQL 命令行中执行。
  • 在项目根目录下构建
make
# 构建成功后项目的 env/bin 目录下会有可执行脚本
  • 执行单元测试
make test
  • 代码风格检查
make flake8
  • 运行 Web Server
env/bin/web
  • 运行爬虫程序
# 启动定时任务分发器
env/bin/celery_beat
# 启动爬取 职位数据 的 worker(每个月自动执行一次)
env/bin/celery_lagou_data_worker
# 启动爬取 职位数量 的 worker(每天晚上自动执行一次)
env/bin/celery_jobs_count_worker 
  • env/bin 目录下其他可执行脚本
# 直接爬取职位数量
env/bin/crawl_jobs_count        
# 直接爬取职位数据
env/bin/crawl_lagou_data       
# 启动celery监控 
env/bin/celery_flower            
  • 清除构建信息
make clean

TODO

  • 前后端分离

  • 重构数据库访问方式

  • 缓存、失效机制

  • Fix Bug: MySQL Server has gone away. 详见此MR

其他常见问题

有问题?联系我解决:

webspider's People

Contributors

justforfunnnn avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.