Giter Site home page Giter Site logo

selenium-php's Introduction

使用 php 配合 selenium 进行数据采集,手摸手教学

tips!

本项目以采集 猪八戒任务 为例仅用于学习交流,采集前请阅读 robots.txt 协议

禁止用于非法行为,后果自负

运行环境及依赖说明

运行环境:php7.1,redis-4.0,mysql-5.6
依赖:java,chrome,chromedriver,selenium

依赖下载,已在百度网盘帮你准备好

链接:https://pan.baidu.com/s/1gbSckvixLMbW5JB3eaY6dQ 提取码:29qb

如不使用网盘,依赖包下载链接如下

依赖1: java jdk8 download

https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

依赖2: chrome download, my version: 76.0.3809.132

https://www.chromedownloads.net/

依赖3: chromedriver download, my version: 72.0.3626.69

https://chromedriver.storage.googleapis.com/index.html?path=72.0.3626.69/
download other version:
https://chromedriver.storage.googleapis.com/index.html

依赖4: selenium server download, version: selenium-server-standalone-3.141.59

https://www.seleniumhq.org/download/

使用流程

  • 安装好运行环境及依赖,并启动
  • 创建数据库,导入数据表 sql
    mysql -u username -ppassword -e "create database selenium_php character set utf8 collate utf8_general_ci" mysql -u username -ppassword selenium_php < zhubajie.sql
  • 配置 .env,redis mysql
  • cd selenium-php
  • java -jar selenium-server-standalone-3.141.59.jar
  • 采集列表页(爬取页码当前写死2~5页)php scripts/zhubajie/spider_list.php >> ./log/spider_list.log 2>&1
  • 列表页采集完成后,将任务丢进 redis 队列(方便详情页多进程采集)php scripts/zhuabajie/get_db_id_to_redis.php
  • 采集详情页 php scripts/zhuabajie/spider_detail.php >> ./log/spider_detail.log 2>&1

FAQ

因为无法验证开发者

sudo spctl --master-disable

windows 如何设置环境变量

https://www.java.com/zh_CN/download/help/path.xml

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.