Giter Site home page Giter Site logo

crawla's Introduction

Crawla - a simple web crawler library

Scrutinizer Code Quality Code Coverage Build Status

Installation

Via composer

$ composer require radowoj/crawla

Example 1 - get titles, counts of commits and readmes from pages linked from an entry point

<?php

use Symfony\Component\DomCrawler\Crawler as DomCrawler;

require_once('../vendor/autoload.php');

$crawler = new \Radowoj\Crawla\Crawler(
    'https://github.com/radowoj'
);

$dataGathered = [];

//configure our crawler
//first - set CSS selector for links that should be visited
$crawler->setLinkSelector('span.pinned-repo-item-content span.d-block a.text-bold')

    //second - customize guzzle client used for requests
    ->setClient(new GuzzleHttp\Client([
        GuzzleHttp\RequestOptions::DELAY => 100
    ]))

    //third - define what should be done, when a page was visited?
    ->setPageVisitedCallback(function(DomCrawler $domCrawler) use(&$dataGathered) {
        //callback will be called for every visited page, including the base url, so let's ensure that
        //repo data will be gathered only on repo pages
        if (!preg_match('/radowoj\/\w+/', $domCrawler->getUri())) {
            return;
        }

        $readme = $domCrawler->filter('#readme');

        $dataGathered[] = [
            'title' => trim($domCrawler->filter('span[itemprop="about"]')->text()),
            'commits' => trim($domCrawler->filter('li.commits span.num')->text()),
            'readme' => $readme->count() ? trim($readme->text()) : '',
        ];
    });

//now crawl, following up to 1 links deep from the entry point
$crawler->crawl(1);

var_dump($dataGathered);

var_dump($crawler->getVisited()->all());

Example 2 - simple site map

<?php

require_once('../vendor/autoload.php');

$crawler = new \Radowoj\Crawla\Crawler(
    'https://developer.github.com/'
);

$dataGathered = [];

//configure our crawler
$crawler->setClient(new GuzzleHttp\Client([
        GuzzleHttp\RequestOptions::DELAY => 100
    ]))
    
    //set link selector (all links - this is the default value)
    ->setLinkSelector('a');

//check up to 1 levels deep
$crawler->crawl(1);

//get links of all visited pages
var_dump($crawler->getVisited()->all());

//get links that were too deep to visit
var_dump($crawler->getTooDeep()->all());

crawla's People

Contributors

radowoj avatar

Stargazers

David Molnar avatar Neil Batchelor avatar Martin Lechêne avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.