Giter Site home page Giter Site logo

michaeluno / php-simple-web-scraper Goto Github PK

View Code? Open in Web Editor NEW
20.0 2.0 19.0 1.43 MB

A PHP application which runs on Heroku and dumps web site outputs including JavaScript generated contents.

License: MIT License

PHP 65.28% JavaScript 34.40% HTML 0.32%
heroku scraper php phantomjs crowler proxy cross-site cross-domain cross-domain-request cross-domain-solution

php-simple-web-scraper's Introduction

PHP Simple Web Scraper

A PHP application for Heroku, which can dump web site outputs including JavaScript generated contents.

Demo

Visit here. If the server is sleeping, it takes several seconds to wake up.

Usage

Basic Usage

Perform an HTTP request with the url query parameter and encoded URL as a value.

http(s)://{app-address}/?url={encoded target url}

Example

http(s)://{app-address}/?url=https%3A%2F%2Fgithub.com

Parameters

output

Determines the output type, which includes html, json, screenshot.

html (default)

HTML source code of the target web site. JavaScript generated contents are also retrieved and dumped.

json

output=json

HTTP response data as JSON. Useful for cross domain communications with JSONP.

Example
http(s)://{app-address}/?url=https%3A%2F%2Fgithub.com&output=json
screenshot

output=screenshot

A jpeg image of the site snapshot.

Example
http(s)://{app-address}/?url=https%3A%2F%2Fgithub.com&output=screenshot

file-type

When screenshot is given for the output parameter, the output file type can be set with the file-type parameter. Default: jpg.

It accepts the following values: pdf, png, jpg, jpeg, bmp, ppm.

width

When screenshot is given for the output parameter, width sets the screenshot image width.

height

When screenshot is given for the output parameter, height sets the screenshot image height. Leave it unset to get full height. The default minimum height is 720 pixels.

Example
http(s)://{app-address}/?url=https%3A%2F%2Fgithub.com&output=screenshot&file-type=png

user-agent

Sets a custom user agent. By default, the client's user agent accessing the app will be used. This can be changed by specifying the value with this parameter.

If random is given, the user-agent will be randomly assigned.

Example

To set a user agent, Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100102 Firefox/57.0,

http(s)://{app-address}/?url=https%3A%2F%2Fwww.whatismybrowser.com%2Fdetect%2Fwhat-http-headers-is-my-browser-sending&user-agent=Mozilla/5.0%20(Windows%20NT%206.1;%20Win64;%20x64;%20rv:57.0)%20Gecko/20100102%20Firefox/57.0
http(s)://{app-address}/?url=https%3A%2F%2Fwww.whatismybrowser.com%2Fdetect%2Fwhat-http-headers-is-my-browser-sending&user-agent=random

load-images

Decides whether to load images. By default, this is disabled for the html and json output types. Enabled for the screenshot output type.

Accepts a boolean value true, false, or 1, 0.

Example
http(s)://{app-address}/?url=https%3A%2F%2Fwww.whatismybrowser.com%2Fdetect%2Fwhat-http-headers-is-my-browser-sending&user-agent=Mozilla/5.0%20(Windows%20NT%206.1;%20Win64;%20x64;%20rv:57.0)%20Gecko/20100102%20Firefox/57.0

output-encoding

Sets the encoding used for the output. Default: utf8

cache-lifespan

All requests are cached for 20 minutes by default. This detemines how long the cache should be retained. If you do not want a cached result or want to renew the cache, pass 0. Default: 1200.

headers

Sets a custom HTTP headers. Accepts the value as an array.

Example

To set DNT value,

http(s)://{app-address}/?url=https%3A%2F%2Fwww.whatismybrowser.com%2Fdetect%2Fwhat-http-headers-is-my-browser-sending&headers[DNT]=1

method

HTTP request method. Default: GET. Accepts the followings.

  • OPTIONS
  • GET
  • HEAD
  • POST
  • PUT
  • DELETE
  • PATCH

When using POST, give sending post data with the data request key. The program checks $_REQUEST[ 'data' ] to send POST data.

Example
http(s)://{app-address}/?url=http%3A%2F%2Fhttpbin.org%2Fpost&method=POST&data[foo]=bar

Run as Heroku Application

This is a Heroku application and meant to be deployed to a Heroku application instance.

Requirements

Steps to Deploy

a) Quick Deploy

You may simply use the following button to deploy this application:

Deploy

b) Manual Deploy

  1. Clone this repository to your local machine. Create a directory and from there, in a console window, type the following.
git clone https://github.com/michaeluno/php-simple-web-scraper.git

This will download the repository files.

  1. Change the working directory to the cloned one.
cd php-simple-web-scraper
  1. Login to Heroku from Heroku CLI.
heroku login
  1. Create a new Heroku app.
heroku create

This gives somehing like this with a random app name. glacial-basin-46381 is the app name in the below example.

https://glacial-basin-46381.herokuapp.com/ | https://git.heroku.com/glacial-basin-46381.git
  1. Type the following. Replace {heroku-app-name} with your app name given in the above step.
heroku git:remote -a {heroku-app-name}
  1. Upload the files to Heroku.
git push heroku master
  1. Open the app in your browser.
heroku open

php-simple-web-scraper's People

Contributors

michaeluno avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

php-simple-web-scraper's Issues

Javascript content show issue

Hi, by using heroku i am deployed the app, and it works like a fire their and tried to get the javascript content of the web page, and exactly what i needed i got it , but the same thing i tried in Windows server 2019, everything installed without any issues including phantomJs also.
When i try to crawl the same web page here, i am not getting javascript content what i got in heroku.

Can i know what is the possible reason for this ?
PhantomJs bin path:
http://3.134.82.150/crowl/vendor/bin/phantomjs.exe
Here the details of links what i am used
Heroku link:
https://pcrawl.herokuapp.com/
Or try at yours web page
https://php-simple-web-scraper.herokuapp.com/

Windows link:
http://3.134.82.150/crowl/web/
Crawl link:
http://3.134.82.150/d_a.php

Declaration of JonnyW\PhantomJs\DependencyInjection\ServiceContainer::load()

Issue

Warning: Declaration of JonnyW\PhantomJs\DependencyInjection\ServiceContainer::load() should be compatible with Symfony\Component\DependencyInjection\Container::load($file) in : php-simple-web-scraper\vendor\jonnyw\php-phantomjs\src\JonnyW\PhantomJs\DependencyInjection\ServiceContainer.php on line 20

Fatal error: Uncaught Error: Call to undefined method JonnyW\PhantomJs\Http\Request::addSetting() in php-simple-web-scraper\web\include\class\Browser\Browser.php on line 17

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.