Crawla - a simple web crawler library

Installation

Via composer

$ composer require radowoj/crawla

Example 1 - get titles, counts of commits and readmes from pages linked from an entry point

<?php

use Symfony\Component\DomCrawler\Crawler as DomCrawler;

require_once('../vendor/autoload.php');

$crawler = new \Radowoj\Crawla\Crawler(
    'https://github.com/radowoj'
);

$dataGathered = [];

//configure our crawler
//first - set CSS selector for links that should be visited
$crawler->setLinkSelector('span.pinned-repo-item-content span.d-block a.text-bold')

    //second - customize guzzle client used for requests
    ->setClient(new GuzzleHttp\Client([
        GuzzleHttp\RequestOptions::DELAY => 100
    ]))

    //third - define what should be done, when a page was visited?
    ->setPageVisitedCallback(function(DomCrawler $domCrawler) use(&$dataGathered) {
        //callback will be called for every visited page, including the base url, so let's ensure that
        //repo data will be gathered only on repo pages
        if (!preg_match('/radowoj\/\w+/', $domCrawler->getUri())) {
            return;
        }

        $readme = $domCrawler->filter('#readme');

        $dataGathered[] = [
            'title' => trim($domCrawler->filter('span[itemprop="about"]')->text()),
            'commits' => trim($domCrawler->filter('li.commits span.num')->text()),
            'readme' => $readme->count() ? trim($readme->text()) : '',
        ];
    });

//now crawl, following up to 1 links deep from the entry point
$crawler->crawl(1);

var_dump($dataGathered);

var_dump($crawler->getVisited()->all());

Example 2 - simple site map

<?php

require_once('../vendor/autoload.php');

$crawler = new \Radowoj\Crawla\Crawler(
    'https://developer.github.com/'
);

$dataGathered = [];

//configure our crawler
$crawler->setClient(new GuzzleHttp\Client([
        GuzzleHttp\RequestOptions::DELAY => 100
    ]))
    
    //set link selector (all links - this is the default value)
    ->setLinkSelector('a');

//check up to 1 levels deep
$crawler->crawl(1);

//get links of all visited pages
var_dump($crawler->getVisited()->all());

//get links that were too deep to visit
var_dump($crawler->getTooDeep()->all());

radowoj / crawla Goto Github PK

crawla's Introduction

Crawla - a simple web crawler library

Installation

Example 1 - get titles, counts of commits and readmes from pages linked from an entry point

Example 2 - simple site map

crawla's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent