chuyskywalker / rolling-curl Goto Github PK

View Code? Open in Web Editor NEW

This project forked from lionsad/rolling-curl

208.0 17.0 52.0 344 KB

Rolling-Curl: A non-blocking, non-dos multi-curl library for PHP

Home Page: https://github.com/chuyskywalker/rolling-curl

PHP 100.00%

rolling-curl's Introduction

RollingCurl

A cURL library to fetch a large number of resources while maintaining a consistent number of simultaneous connections

Authors:

Jeff Minard (jrm.cc)
Josh Fraser (joshfraser.com)
Alexander Makarov (rmcreative.ru)

Overview

RollingCurl is a more efficient implementation of curl_multi().

curl_multi is a great way to process multiple HTTP requests in parallel in PHP but suffers from a few faults:

The documentation for curl_multi is very obtuse and, as such, is easy to incorrectly or poorly implement
Most curl_multi examples queue up all requests and execute them all at once

The second point is the most important one for two reasons:

If you have to wait on every single request to complete, your program is "blocked" by the longest running request.
More importantly, when you run a large number of cURL requests simultaneously you are, essentially, running a DOS attack. If you have to fetch hundreds or even thousands of URLs you're very likely to be blocked by automatic DOS systems. At best, you're not being a very respectful citizen of the internet.

RollingCurl deals with both issues by maintaining a maximum number of simultaneous requests and "rolling" new requests into the queue as existing requests complete. When requests complete, and while other requests are still running, RollingCurl can run an anonymous function to process the fetched result. (You have the option to skip the function and instead process all requests once they are done, should you prefer.)

Installation (via composer)

Get composer and add this in your requires section of the composer.json:

{
    "require": {
        "chuyskywalker/rolling-curl": "*"
    }
}

and then

composer install

Usage

Basic Example

$rollingCurl = new \RollingCurl\RollingCurl();
$rollingCurl
    ->get('http://yahoo.com')
    ->get('http://google.com')
    ->get('http://hotmail.com')
    ->get('http://msn.com')
    ->get('http://reddit.com')
    ->setCallback(function(\RollingCurl\Request $request, \RollingCurl\RollingCurl $rollingCurl) {
        // parsing html with regex is evil (http://bit.ly/3x9sQX), but this is just a demo
        if (preg_match("#<title>(.*)</title>#i", $request->getResponseText(), $out)) {
            $title = $out[1];
        }
        else {
            $title = '[No Title Tag Found]';
        }
        echo "Fetch complete for (" . $request->getUrl() . ") $title " . PHP_EOL;
    })
    ->setSimultaneousLimit(3)
    ->execute();

Fetch A Very Large Number Of Pages

Let's scrape google for the first 500 links & titles for "curl"

$rollingCurl = new \RollingCurl\RollingCurl();
for ($i = 0; $i <= 500; $i+=10) {
    // https://www.google.com/search?q=curl&start=10
    $rollingCurl->get('https://www.google.com/search?q=curl&start=' . $i);
}

$results = array();

$start = microtime(true);
echo "Fetching..." . PHP_EOL;
$rollingCurl
    ->setCallback(function(\RollingCurl\Request $request, \RollingCurl\RollingCurl $rollingCurl) use (&$results) {
        if (preg_match_all('#<h3 class="r"><a href="([^"]+)">(.*)</a></h3>#iU', $request->getResponseText(), $out)) {
            foreach ($out[1] as $idx => $url) {
                parse_str(parse_url($url, PHP_URL_QUERY), $params);
                $results[$params['q']] = strip_tags($out[2][$idx]);
            }
        }

        // Clear list of completed requests and prune pending request queue to avoid memory growth
        $rollingCurl->clearCompleted();
        $rollingCurl->prunePendingRequestQueue();

        echo "Fetch complete for (" . $request->getUrl() . ")" . PHP_EOL;
    })
    ->setSimultaneousLimit(10)
    ->execute();
;
echo "...done in " . (microtime(true) - $start) . PHP_EOL;

echo "All results: " . PHP_EOL;
print_r($results);

Setting custom curl options

For every request

$rollingCurl = new \RollingCurl\RollingCurl();
$rollingCurl
    // setOptions will overwrite all the default options.
    // addOptions is probably a better choice
    ->setOptions(array(
        CURLOPT_HEADER => true,
        CURLOPT_NOBODY => true
    ))
    ->get('http://yahoo.com')
    ->get('http://google.com')
    ->get('http://hotmail.com')
    ->get('http://msn.com')
    ->get('http://reddit.com')
    ->setCallback(function(\RollingCurl\Request $request, \RollingCurl\RollingCurl $rollingCurl) {
        echo "Fetch complete for (" . $request->getUrl() . ")" . PHP_EOL;
    })
    ->setSimultaneousLimit(3)
    ->execute();

For a single request:

$rollingCurl = new \RollingCurl\RollingCurl();

$sites = array(
    'http://yahoo.com' => array(
        CURLOPT_TIMEOUT => 15
    ),
    'http://google.com' => array(
        CURLOPT_TIMEOUT => 5
    ),
    'http://hotmail.com' => array(
        CURLOPT_TIMEOUT => 10
    ),
    'http://msn.com' => array(
        CURLOPT_TIMEOUT => 10
    ),
    'http://reddit.com' => array(
        CURLOPT_TIMEOUT => 25
    ),
);

foreach ($sites as $url => $options) {
    $request = new \RollingCurl\Request($url);
    $rollingCurl->add(
        $request->addOptions($options)
    );
}

$rollingCurl->execute();

More examples can be found in the examples/ directory.

TODO:

PHPUnit test
Ensure PSR spec compatibility
Fix TODOs
Better validation on setters

Feel free to fork and pull request to help out with the above. :D

Similar Projects

http://code.google.com/p/multirequest/

rolling-curl's People

Contributors

$polyfractal avatar$

Stargazers

Watchers

Forkers

polyfractal bizonix hard-geha wenjun1055 rowanbeentje xcodez johnmadrak-wobo xiankai dennisoderwald stanishevsky tomasnorre timgws malisetti sudokien braseidon heinlee james-mckinnon mzf jaggedsoft soachishti hakz jeffs000 xsuchy09 piterskiy incapsulate sreenathmenon dmitri10 themismin 3l06 zkc226 clayliddell g1k turgay347 erenctnky si0n vitaliysheverov ghabes oddtwelve rlorenzo cat-of-devil demiroglu kactetus abdiiwan1841 cclites bitf12m033 warifp jiacun jskgh kamnevtmb spankders mahelbir vipzz

rolling-curl's Issues

Suggestion: facilitate recursivity

Hello,

I use this script since the 1.0 version for recursivity (especially with the "callback" function in adding new URL on the fly) while this version seems not to do it (I'm probably wrong and don't know how to use it as it is), so here my suggestion (what I have done).

In "rolling-curl/src/RollingCurl/RollingCurl.php", just move the "callback" (lines 275 to 279, "remove the curl handle that just completed") before line 262 ("// start a new request (it's important to do this before removing the old one").

With this, you can add new URL on the fly in the "callback" function (you have to pass "RollingCurl" object in it).

Maybe it can be changed on these source?

Ty for this very useful script.

Ip rotation

Have i need to rorate ip while using this rolling curl to avoid blocking condition while getting html source for any site.

CURLOPT_PROXY for single requests give fail if threads are > ~10

Hello, so I am trying to setup a proxy checker with roling-curl, but setting CURLOPT_PROXY for every request all the requests fail if the number of threads is > ~10, whereas if I set the IP of the proxies as CURLOPT_URL I can run 500 threads fine and results are consistent. Is this a limitation of curl itself? It seems I have this behavior with any wrapper of curl_multi, so maybe curl can't handle many different CURLOPT_PROXY settings simultaneously?

Duplicate lines in RollingCurl.php

Lines 251-252 are duplicates:

$request->setResponseErrno(curl_errno($transfer['handle']));
$request->setResponseError(curl_error($transfer['handle']));

Is it possible to add additional requests after the execution started?

E.g. when RollingCurl downloads page I parse it and find out I need to download different page (because of pagination, redirect or something like that).

Post Request Problem

Hi Team,

It seems I encountered a problem using the POST Request Method.

I have post data request like this

array(3) {
["transfer_type"]=>
string(7) "one-way"
["transfer_details"]=>
array(1) {
["first_transfer"]=>
array(5) {
["pickup_type"]=>
string(1) "1"
["pick_up_city_code"]=>
string(3) "LON"
["pickup"]=>
string(7) "airport"
["dropoff"]=>
string(5) "hotel"
["transfer_date"]=>
string(10) "2018-10-08"
}
}
["pax"]=>
array(2) {
["adult"]=>
int(1)
["child_age"]=>
array(1) {
[0]=>
int(5)
}
}
}

when I send this post data request to my other API the request that I am getting there became like this

array(3) {
["transfer_type"]=>
string(7) "one-way"
["transfer_details"]=>
string(5) "Array"
["pax"]=>
string(5) "Array"
}

What do you think was the cause of this problem?

Thank Youuuuuuu

Is it possible to get the data of each URL after the execution?

Is it possible to get the data of each URL after the execution? I couldn't find any documentation to get the page content after the execution. It's impossible to save the data from the callback..

Page source not taking response occur by javascript execution

I am not getting page source (html) that is generate by javascript execution. I found a solution like phantomjs. Any solution

Thanks

Is it possible to pass a variable to the anonymous callback function?

Lets say I have an array

$footprints = array('test', 'test2');

How would I pass it to

$rollingCurl->setCallback(function (\RollingCurl\Request $request, \RollingCurl\RollingCurl $rollingCurl) {

Is there a way to add sleep for each request?

Currently I can slow down the requests by reducing SimultaneousLimit , but I am wondering if there is a way to add some random sleep for each request, to make it more nature?

Thanks!

downloading image from api

hi, i'm trying to use your rolling-curl class for downloading image and save them to local drive but it is not working for that if m doing the same file_get_contents() is it working but i have to download bulk images so i need something faster like this... so can you help me out
here is a demo link of image
https://maps.googleapis.com/maps/api/streetview?size=400x400&location=taj%20Mahal&key=AIzaSyDTn9FYuxm3h3jKbEjwViHb7TKaCsXhUxI

Mixing IPv6 and IPv4 requests

First off, I love rolling-curl. Great work! I noticed that if you mix IPv4 or IPv6, all the requests get set to one or the other. This is not desirable in my situation as I'm using a mixture of IPv4 and IPv6 interfaces for scraping. Let me know if I can help in any way.

The ideal output below would be your IPv6 IP and your IPv4 IP both.

$sites = [
'http://icanhazip.com?1'=>[CURLOPT_IPRESOLVE=>CURL_IPRESOLVE_V4],
'http://icanhazip.com?2'=>[CURLOPT_IPRESOLVE=>CURL_IPRESOLVE_V6],
];
foreach ($sites as $url => $options) {
    $request = new \RollingCurl\Request($url);
    $rc->add($request->addOptions($options));
}
$rc->setCallback(function(\RollingCurl\Request $request, \RollingCurl\RollingCurl $rc) {
            $diff = microtime(true) - $_SERVER["REQUEST_TIME_FLOAT"];
            echo "..{$diff}\t".$request->getUrl().": ".$request->getResponseText()."\n";
});
$rc->execute();
$diff = microtime(true) - $_SERVER["REQUEST_TIME_FLOAT"];
echo "..{$diff}\tdone\n";

Bug in pruning pending request queue

Hi,

I just found that prunePendingRequestQueue does not work as it should, it always return empty list because of a bug in getNextPendingRequests. Please see this code:

    private function getNextPendingRequests($limit = 1)
    {
        $requests = array();
        while ($limit--) {
            if (!isset($this->pendingRequests[$this->pendingRequestsPosition])) {
                break;
            }
            $requests[] = $this->pendingRequests[$this->pendingRequestsPosition];
            $this->pendingRequestsPosition++;
        }
        return $requests;
    }

If $limit is 0, the while loop will never run. I propose this patch to fix this bug:

-        while ($limit--) {
+        $countPending = $limit <= 0 ? $this->countPending() : $limit;
+        while ($countPending--) {

(Emailed in) Prune & Clear

My apologies for emailing you about your Repository, however it appears the Issues button is turned off and I wanted to ask for your input on a PR I’d like to make before I start, since you seem to be the head of the project.

Currently, RollingCurl::clearCompleted() states it will help prevent out of memory errors, but it only clears the completedRequests array. In my opinion, it would be extremely beneficial to get the behavior of prunePendingRequestQueue() during that process as well.

I realize that a developer could make their own code run both, but if that is the desired approach I think the documentation and phpdocs should be updated slightly to reflect the necessity for that. When you’re running RollingCurl in a continuous script and handling thousands of requests, it can cause issues because of the pendingRequests array and without checking the code directly you wouldn’t know the prune method exists.

I was thinking that perhaps requests could be removed from the pendingRequests array as they are being processed or the prunePendingRequestQueue() could also be called inside clearCompleted().

I just wanted to know if you’d prefer the documentation update approach or the programmatic approach. Please let me know and I’ll gladly create a PR.

Thanks again for your great library and again, I apologize for contacting you via email about your repository.

$html = $request->getResponseText(); gives null result

Respected Sir
In setcallback function $html = $request->getResponseText(); gives null result but if insert google.com,msn.com then it works but for some site yahoo.com getResponseText() gives null result

How can i get curl_errno()

Hello, i want to get curl_errno() message. How can i do this?