Giter Site home page Giter Site logo

chuyskywalker / rolling-curl Goto Github PK

View Code? Open in Web Editor NEW

This project forked from lionsad/rolling-curl

207.0 17.0 53.0 344 KB

Rolling-Curl: A non-blocking, non-dos multi-curl library for PHP

Home Page: https://github.com/chuyskywalker/rolling-curl

PHP 100.00%

rolling-curl's Introduction

RollingCurl

A cURL library to fetch a large number of resources while maintaining a consistent number of simultaneous connections

Authors:

  • Jeff Minard (jrm.cc)
  • Josh Fraser (joshfraser.com)
  • Alexander Makarov (rmcreative.ru)

Overview

RollingCurl is a more efficient implementation of curl_multi().

curl_multi is a great way to process multiple HTTP requests in parallel in PHP but suffers from a few faults:

  1. The documentation for curl_multi is very obtuse and, as such, is easy to incorrectly or poorly implement
  2. Most curl_multi examples queue up all requests and execute them all at once

The second point is the most important one for two reasons:

  1. If you have to wait on every single request to complete, your program is "blocked" by the longest running request.
  2. More importantly, when you run a large number of cURL requests simultaneously you are, essentially, running a DOS attack. If you have to fetch hundreds or even thousands of URLs you're very likely to be blocked by automatic DOS systems. At best, you're not being a very respectful citizen of the internet.

RollingCurl deals with both issues by maintaining a maximum number of simultaneous requests and "rolling" new requests into the queue as existing requests complete. When requests complete, and while other requests are still running, RollingCurl can run an anonymous function to process the fetched result. (You have the option to skip the function and instead process all requests once they are done, should you prefer.)

Installation (via composer)

Get composer and add this in your requires section of the composer.json:

{
    "require": {
        "chuyskywalker/rolling-curl": "*"
    }
}

and then

composer install

Usage

Basic Example

$rollingCurl = new \RollingCurl\RollingCurl();
$rollingCurl
    ->get('http://yahoo.com')
    ->get('http://google.com')
    ->get('http://hotmail.com')
    ->get('http://msn.com')
    ->get('http://reddit.com')
    ->setCallback(function(\RollingCurl\Request $request, \RollingCurl\RollingCurl $rollingCurl) {
        // parsing html with regex is evil (http://bit.ly/3x9sQX), but this is just a demo
        if (preg_match("#<title>(.*)</title>#i", $request->getResponseText(), $out)) {
            $title = $out[1];
        }
        else {
            $title = '[No Title Tag Found]';
        }
        echo "Fetch complete for (" . $request->getUrl() . ") $title " . PHP_EOL;
    })
    ->setSimultaneousLimit(3)
    ->execute();

Fetch A Very Large Number Of Pages

Let's scrape google for the first 500 links & titles for "curl"

$rollingCurl = new \RollingCurl\RollingCurl();
for ($i = 0; $i <= 500; $i+=10) {
    // https://www.google.com/search?q=curl&start=10
    $rollingCurl->get('https://www.google.com/search?q=curl&start=' . $i);
}

$results = array();

$start = microtime(true);
echo "Fetching..." . PHP_EOL;
$rollingCurl
    ->setCallback(function(\RollingCurl\Request $request, \RollingCurl\RollingCurl $rollingCurl) use (&$results) {
        if (preg_match_all('#<h3 class="r"><a href="([^"]+)">(.*)</a></h3>#iU', $request->getResponseText(), $out)) {
            foreach ($out[1] as $idx => $url) {
                parse_str(parse_url($url, PHP_URL_QUERY), $params);
                $results[$params['q']] = strip_tags($out[2][$idx]);
            }
        }

        // Clear list of completed requests and prune pending request queue to avoid memory growth
        $rollingCurl->clearCompleted();
        $rollingCurl->prunePendingRequestQueue();

        echo "Fetch complete for (" . $request->getUrl() . ")" . PHP_EOL;
    })
    ->setSimultaneousLimit(10)
    ->execute();
;
echo "...done in " . (microtime(true) - $start) . PHP_EOL;

echo "All results: " . PHP_EOL;
print_r($results);

Setting custom curl options

For every request

$rollingCurl = new \RollingCurl\RollingCurl();
$rollingCurl
    // setOptions will overwrite all the default options.
    // addOptions is probably a better choice
    ->setOptions(array(
        CURLOPT_HEADER => true,
        CURLOPT_NOBODY => true
    ))
    ->get('http://yahoo.com')
    ->get('http://google.com')
    ->get('http://hotmail.com')
    ->get('http://msn.com')
    ->get('http://reddit.com')
    ->setCallback(function(\RollingCurl\Request $request, \RollingCurl\RollingCurl $rollingCurl) {
        echo "Fetch complete for (" . $request->getUrl() . ")" . PHP_EOL;
    })
    ->setSimultaneousLimit(3)
    ->execute();

For a single request:

$rollingCurl = new \RollingCurl\RollingCurl();

$sites = array(
    'http://yahoo.com' => array(
        CURLOPT_TIMEOUT => 15
    ),
    'http://google.com' => array(
        CURLOPT_TIMEOUT => 5
    ),
    'http://hotmail.com' => array(
        CURLOPT_TIMEOUT => 10
    ),
    'http://msn.com' => array(
        CURLOPT_TIMEOUT => 10
    ),
    'http://reddit.com' => array(
        CURLOPT_TIMEOUT => 25
    ),
);

foreach ($sites as $url => $options) {
    $request = new \RollingCurl\Request($url);
    $rollingCurl->add(
        $request->addOptions($options)
    );
}

$rollingCurl->execute();

More examples can be found in the examples/ directory.

TODO:

  • PHPUnit test
  • Ensure PSR spec compatibility
  • Fix TODOs
  • Better validation on setters

Feel free to fork and pull request to help out with the above. :D

Similar Projects

rolling-curl's People

Contributors

adirelle avatar bizonix avatar chuyskywalker avatar jacobbennett avatar johnmadrak avatar lionsad avatar mseymour avatar polyfractal avatar xiankai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rolling-curl's Issues

(Emailed in) Prune & Clear

My apologies for emailing you about your Repository, however it appears the Issues button is turned off and I wanted to ask for your input on a PR I’d like to make before I start, since you seem to be the head of the project.

Currently, RollingCurl::clearCompleted() states it will help prevent out of memory errors, but it only clears the completedRequests array. In my opinion, it would be extremely beneficial to get the behavior of prunePendingRequestQueue() during that process as well.

I realize that a developer could make their own code run both, but if that is the desired approach I think the documentation and phpdocs should be updated slightly to reflect the necessity for that. When you’re running RollingCurl in a continuous script and handling thousands of requests, it can cause issues because of the pendingRequests array and without checking the code directly you wouldn’t know the prune method exists.

I was thinking that perhaps requests could be removed from the pendingRequests array as they are being processed or the prunePendingRequestQueue() could also be called inside clearCompleted().

I just wanted to know if you’d prefer the documentation update approach or the programmatic approach. Please let me know and I’ll gladly create a PR.

Thanks again for your great library and again, I apologize for contacting you via email about your repository.

Bug in pruning pending request queue

Hi,

I just found that prunePendingRequestQueue does not work as it should, it always return empty list because of a bug in getNextPendingRequests. Please see this code:

    private function getNextPendingRequests($limit = 1)
    {
        $requests = array();
        while ($limit--) {
            if (!isset($this->pendingRequests[$this->pendingRequestsPosition])) {
                break;
            }
            $requests[] = $this->pendingRequests[$this->pendingRequestsPosition];
            $this->pendingRequestsPosition++;
        }
        return $requests;
    }

If $limit is 0, the while loop will never run. I propose this patch to fix this bug:

-        while ($limit--) {
+        $countPending = $limit <= 0 ? $this->countPending() : $limit;
+        while ($countPending--) {

Is there a way to add sleep for each request?

Currently I can slow down the requests by reducing SimultaneousLimit , but I am wondering if there is a way to add some random sleep for each request, to make it more nature?

Thanks!

Ip rotation

Have i need to rorate ip while using this rolling curl to avoid blocking condition while getting html source for any site.

Suggestion: facilitate recursivity

Hello,

I use this script since the 1.0 version for recursivity (especially with the "callback" function in adding new URL on the fly) while this version seems not to do it (I'm probably wrong and don't know how to use it as it is), so here my suggestion (what I have done).

In "rolling-curl/src/RollingCurl/RollingCurl.php", just move the "callback" (lines 275 to 279, "remove the curl handle that just completed") before line 262 ("// start a new request (it's important to do this before removing the old one").

With this, you can add new URL on the fly in the "callback" function (you have to pass "RollingCurl" object in it).

Maybe it can be changed on these source?

Ty for this very useful script.

CURLOPT_PROXY for single requests give fail if threads are > ~10

Hello, so I am trying to setup a proxy checker with roling-curl, but setting CURLOPT_PROXY for every request all the requests fail if the number of threads is > ~10, whereas if I set the IP of the proxies as CURLOPT_URL I can run 500 threads fine and results are consistent. Is this a limitation of curl itself? It seems I have this behavior with any wrapper of curl_multi, so maybe curl can't handle many different CURLOPT_PROXY settings simultaneously?

Post Request Problem

Hi Team,

It seems I encountered a problem using the POST Request Method.

I have post data request like this

array(3) {
["transfer_type"]=>
string(7) "one-way"
["transfer_details"]=>
array(1) {
["first_transfer"]=>
array(5) {
["pickup_type"]=>
string(1) "1"
["pick_up_city_code"]=>
string(3) "LON"
["pickup"]=>
string(7) "airport"
["dropoff"]=>
string(5) "hotel"
["transfer_date"]=>
string(10) "2018-10-08"
}
}
["pax"]=>
array(2) {
["adult"]=>
int(1)
["child_age"]=>
array(1) {
[0]=>
int(5)
}
}
}

when I send this post data request to my other API the request that I am getting there became like this

array(3) {
["transfer_type"]=>
string(7) "one-way"
["transfer_details"]=>
string(5) "Array"
["pax"]=>
string(5) "Array"
}

What do you think was the cause of this problem?

Thank Youuuuuuu

Duplicate lines in RollingCurl.php

Lines 251-252 are duplicates:

$request->setResponseErrno(curl_errno($transfer['handle']));
$request->setResponseError(curl_error($transfer['handle']));

Mixing IPv6 and IPv4 requests

First off, I love rolling-curl. Great work! I noticed that if you mix IPv4 or IPv6, all the requests get set to one or the other. This is not desirable in my situation as I'm using a mixture of IPv4 and IPv6 interfaces for scraping. Let me know if I can help in any way.

The ideal output below would be your IPv6 IP and your IPv4 IP both.

$sites = [
'http://icanhazip.com?1'=>[CURLOPT_IPRESOLVE=>CURL_IPRESOLVE_V4],
'http://icanhazip.com?2'=>[CURLOPT_IPRESOLVE=>CURL_IPRESOLVE_V6],
];
foreach ($sites as $url => $options) {
    $request = new \RollingCurl\Request($url);
    $rc->add($request->addOptions($options));
}
$rc->setCallback(function(\RollingCurl\Request $request, \RollingCurl\RollingCurl $rc) {
            $diff = microtime(true) - $_SERVER["REQUEST_TIME_FLOAT"];
            echo "..{$diff}\t".$request->getUrl().": ".$request->getResponseText()."\n";
});
$rc->execute();
$diff = microtime(true) - $_SERVER["REQUEST_TIME_FLOAT"];
echo "..{$diff}\tdone\n";

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.