fabienvauchelles / scrapoxy Goto Github PK

Scrapoxy is a super proxy aggregator, allowing you to manage all proxies in one place 🎯, rather than spreading it across multiple scrapers 🕸️. It also smartly handles traffic routing 🔀 to minimize bans and increase success rates 🚀.

Home Page: http://scrapoxy.io

License: MIT License

JavaScript 0.36% HTML 8.89% Shell 0.10% Dockerfile 0.03% TypeScript 88.34% Java 0.07% Python 0.89% SCSS 0.67% Vue 0.65%

blacklisting antibot proxies webscraping

scrapoxy's Introduction

Scrapoxy

What is Scrapoxy?

Scrapoxy is a super proxy aggregator, allowing you to manage all proxies in one place 🎯, rather than spreading it across multiple scrapers 🕸️.

It also smartly handles traffic routing 🔀 to minimize bans and increase success rates 🚀.

🚀🚀 GO TO SCRAPOXY.IO FOR MORE INFORMATION! 🚀🚀

Features

☁️ Datacenter Providers with easy installation ☁️

Scrapoxy supports many datacenter providers like AWS, Azure, or GCP.

It installs a proxy image on each datacenter, helping the quick launch of a proxy instance. Traffic is routed to proxy instances to provide many IP addresses.

Scrapoxy handles the startup/shutdown of proxy instances to rotate IP addresses effectively.

🌐 Proxy Services 🌐

Scrapoxy supports many proxy services like Rayobyte, IPRoyal or Zyte.

It connects to these services and uses a variety of parameters such as country or OS type, to create a diversity of proxies.

💻 Hardware materials 💻

Scrapoxy supports many 4G proxy farms hardware types like Proxidize.

It uses their APIs to handle IP rotation on 4G networks.

📜 Free Proxy Lists 📜

Scrapoxy supports lists of HTTP/HTTPS proxies and SOCKS4/SOCKS5 proxies.

It takes care of testing their connectivity to aggregate them into the proxy pool.

⏰ Timeout free ⏰

Scrapoxy only routes traffic to online proxies.

This feature is useful with residential proxies. Sometimes, proxies may be too slow or inactive. Scrapoxy detects these offline nodes and excludes them from the proxy pool.

🔄 Auto-Rotate proxies 🔄

Scrapoxy automatically changes IP addresses at regular intervals.

Scrapers can have thousands of IP addresses without managing proxy rotation.

🏃 Auto-Scale proxies 🏃

Scrapoxy monitors incoming traffic and automatically scales the number of proxies according to your needs.

It also reduces proxy count to minimize your costs.

🍪 Sticky sessions on Browser 🍪

Scrapoxy can keep the same IP address for a scraping session, even for browsers.

It includes HTTP requests/responses interception mechanism to inject a session cookie, ensuring continuity of the IP address throughout the browser session.

🚨 Ban management 🚨

Scrapoxy injects the name of the proxy into the HTTP responses.

When a scraper detects that a ban has occurred, it can notify Scrapoxy to remove the proxy from the pool.

📡 Traffic interception 📡

Scrapoxy intercepts HTTP requests/responses to modify headers, keeping consistency in your scraping stack. It can add session cookies or specific headers like user-agent.

📊 Traffic monitoring 📊

Scrapoxy measures incoming and outgoing traffic to provide an overview of your scraping session.

It tracks metrics such as the number of requests, active proxy count, requests per proxy, and more.

🌍 Coverage monitoring 🌍

Scrapoxy displays the geographic coverage of your proxies to better understand the global distribution of your proxies.

🚀 Easy-to-use and production-ready 🚀

Scrapoxy is suitable for both beginners and experts.

It can be started in seconds using Docker, or be deployed in a complex, distributed environment with Kubernetes.

🔓 Free and Open Source 🔓

And of course, Scrapoxy remains free and open source, under the MIT license.

I simply ask you to give me credit if you redistribute or use it in a project 🙌.

A warm thank-you message is appreciated as well 😃🙏.

Documentation

More information on scrapoxy.io.

Contributors

Want to contribute? Check out the guide!

Here is my contact on

Sponsorship

Scrapoxy is an open-source project. The project is free for users, but it does come with costs for me.

I invest significant time and resources into maintaining and improving this project, covering expenses for hosting, promotion, and more.

If you appreciate the value Scrapoxy provides and wish to support its continued development, discuss new features, access the roadmap, or receive professional support, please consider becoming a sponsor!

Your support would greatly contribute to the project's sustainability and growth:

Licence

See The MIT License (MIT)

Acknowledgements

I would like to thank all the contributors to the project and the open-source community for their support.

Follow-up

scrapoxy's People

Contributors

Stargazers

Watchers

Forkers

mohandh younesben vincentaubert dot-sean batchris darkslategrey javi-gon hoaivan rclanget fabriciomassula valery-barysok randyamiel pojda anisgandoura vuupe gilby125 df-jonas msmakhlouf saroshfarhan tkyjsa james075 elogiaseo hotrush meetsha jnv reinhardhsu propertyfinder 2dogz hanksantford rollingstone nowicki jallyhe kc1 zeus911 fakegit lookfirst geekhuyang 2cats raidus knightth0r briscula dickmao dewdrops antoinehirtz achenxu sebastianmacias ykankaya edrep doginal ibreakifix zen-li p1o2p3u1 waitingfy mazzly catataw unoffices fullstackenviormentss minways lukaalex hawu0616 manugarri ttilberg ikrasner 40a zed7576 hamada-lemois makhtev nextlevelshit projekta jack-ban cxiaolng imsparsh skizap clockworked247 rpanfili barakamwakisha edmundkorley homepanda sorciulus v188v asad-haider nestedlooper onita77 stone-creator fbertoia allen-oneill sidnvy hhy5277 gsksivesh maivanteo stealthmate kenahon artixz annaechevarria fgaurat leohmoraes omsaaf hung119 acamilleri go-bybr

scrapoxy's Issues

Error in proxy request

Hello,

Sometimes we get response 500 from scrapoxy with a body containing 'error in proxy request'. When i look at the logs, i found lines similar to '=> Error: socket hang up' / 'Error: connect ECONNREFUSED'. Does this mean there was a problem on scrapoxy side, on the proxies sides or on the scraped website ? Is it safe to just retry this particular request ? Is it possible to includes details of the error in the scrapoxy response body ?

Test Zapier Integration

Top!

Ubuntu 16.04 running on startup

sudo update-rc.d proxyup.sh defaults
insserv: warning: script 'proxyup.sh' missing LSB tags and overrides
insserv: There is a loop between service watchdog and proxyup.sh if stopped
insserv:  loop involving service proxyup.sh at depth 2
insserv:  loop involving service watchdog at depth 1
insserv: Stopping proxyup.sh depends on watchdog and therefore on system facility `$all' which can not be true!
insserv: exiting now without changing boot order!
update-rc.d: error: insserv rejected the script header

so, script not starting when instance started

2017-01-15T09:30:59.395Z - debug: [Instance/65185] checkAlive: false / -
2017-01-15T09:30:59.395Z - debug: [Pinger] ping: hostname=xx.xxx.xxx.xxx / port=3128
2017-01-15T09:30:59.431Z - debug: [Instance/65185] changeAlive: false => false

any ideas?

Ghost instance are not pinged

Some instances are listed in the API call.

However, there are never pinged.

=> Need to check consistency between the list, the API and the ping mechanism

add qcloud & aliyun support

腾讯云
 阿里云

thanks~

add AZURE support

Hi,

really great work! Was wondering if you have Azure support planed?

Thanks!

Docker support

I've seen about future tut for Create a proxy AMI (image), but what about docker support?

How to avoid the waterfall effect with blacklisting ?

Error: Cannot update or adjust instances: askedInstances=6

System:

Python 2.7 and all necessary dependencies are managed by Anaconda.

Config.json

sshKeyName, token and password are not empty on runtime.

Logs scrapoxy

scrapoxy start config.json -d
2017-01-16T18:44:05.073Z - info: [Main] The selected provider is digitalocean
2017-01-16T18:44:05.090Z - debug: [Main] listen
2017-01-16T18:44:05.095Z - info: [Commander] GUI is available at http://localhost:8889
2017-01-16T18:44:05.097Z - debug: [Manager] start
2017-01-16T18:44:05.099Z - info: Proxy is listening at http://localhost:8888
2017-01-16T18:44:15.097Z - debug: [Manager] checkInstances
2017-01-16T18:44:17.660Z - debug: [Manager] adjustInstances: required:1 / actual:0
2017-01-16T18:44:17.661Z - debug: [Manager] adjustInstances: add 1 instances
2017-01-16T18:44:17.661Z - debug: [ProviderDigitalOcean] createInstances: count=1
2017-01-16T18:44:19.778Z - debug: [ProviderDigitalOcean] createInstances: actualCount=1
<-- GET /scaling
--> GET /scaling 404 9ms 1.23kb
2017-01-16T18:44:25.112Z - debug: [Manager] checkInstances
2017-01-16T18:44:27.253Z - debug: [Manager] adjustInstances: required:6 / actual:0
2017-01-16T18:44:27.253Z - debug: [Manager] adjustInstances: add 6 instances
2017-01-16T18:44:27.254Z - debug: [ProviderDigitalOcean] createInstances: count=6
2017-01-16T18:44:28.072Z - debug: [ProviderDigitalOcean] createInstances: actualCount=2
2017-01-16T18:44:35.113Z - debug: [Manager] checkInstances
2017-01-16T18:44:37.290Z - debug: [Manager] adjustInstances: required:6 / actual:0
2017-01-16T18:44:37.290Z - debug: [Manager] adjustInstances: add 6 instances
2017-01-16T18:44:37.291Z - debug: [ProviderDigitalOcean] createInstances: count=6
2017-01-16T18:44:37.803Z - debug: [ProviderDigitalOcean] createInstances: actualCount=8
2017-01-16T18:44:37.806Z - error: [Manager] Error: Cannot update or adjust instances: askedInstances=6, security=true
2017-01-16T18:44:45.115Z - debug: [Manager] checkInstances
2017-01-16T18:44:45.729Z - debug: [Manager] adjustInstances: required:6 / actual:0
2017-01-16T18:44:45.729Z - debug: [Manager] adjustInstances: add 6 instances
2017-01-16T18:44:45.730Z - debug: [ProviderDigitalOcean] createInstances: count=6
2017-01-16T18:44:46.300Z - debug: [ProviderDigitalOcean] createInstances: actualCount=8
2017-01-16T18:44:46.300Z - error: [Manager] Error: Cannot update or adjust instances: askedInstances=6, security=true

Logs scrapy

[scrapy] INFO: Spider opened
[scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[scraper] DEBUG: [ScaleMiddleware] Upscale Scrapoxy
[requests.packages.urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 127.0.0.1
[requests.packages.urllib3.connectionpool] DEBUG: http://127.0.0.1:8889 "GET /scaling HTTP/1.1" 404 1257
[scrapy] ERROR: Error caught on signal handler: <bound method ?.spider_opened of <scrapoxy.downloadmiddlewares.scale.ScaleMiddleware object at 0x0000000004523198>>
Traceback (most recent call last):
File "c:\users..\anaconda3\lib\site-packages\twisted\internet\defer.py", line 149, in maybeDeferred
result = f(*args, **kw)
File "c:\users..\anaconda3\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "c:\users..\anaconda3\lib\site-packages\scrapoxy\downloadmiddlewares\scale.py", line 45, in spider_opened
min_sc, required_sc, max_sc = self._commander.get_scaling()
File "c:\users..\anaconda3\lib\site-packages\scrapoxy\commander.py", line 80, in get_scaling
r.raise_for_status()
File "c:\users..\anaconda3\lib\site-packages\requests\models.py", line 893, in raise_for_status
raise HTTPError(http_error_msg, response=self)
HTTPError: 404 Client Error: Not Found for url: http://127.0.0.1:8889/scaling
[scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
[scraper] WARNING: [WaitMiddleware] Sleeping 120 seconds because no proxy is found: [Master] Error: No running instance found
[scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
..
...
ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=8889): Max retries exceeded with url: /scaling (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x00000000052A7B00>: Failed to establish a new connection: [Errno 10061] No connection could be made because the target machine actively refused it',))
...
...

Everything else is implemented as shown in the documentation.

Vscale - create instance

We try to use vscale provider, but we have this issue :

2017-02-17T13:35:44.041Z - debug: [Manager] checkInstances
2017-02-17T13:35:44.238Z - debug: [Manager] adjustInstances: required:1 / actual:0
2017-02-17T13:35:44.238Z - debug: [Manager] adjustInstances: add 1 instances
2017-02-17T13:35:44.238Z - debug: [ProviderVscale] createInstances: count=1
2017-02-17T13:35:44.438Z - debug: [ProviderVscale] createInstances: actualCount=0
2017-02-17T13:35:44.438Z - error: [Manager] Error: Cannot update or adjust instances: Error: Cannot find image by name 'undefined'
    at self._api.getAllImages.then (/usr/lib/node_modules/scrapoxy/server/providers/vscale/index.js:193:35)
    at tryCatcher (/usr/lib/node_modules/scrapoxy/node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (/usr/lib/node_modules/scrapoxy/node_modules/bluebird/js/release/promise.js:504:31)
    at Promise._settlePromise (/usr/lib/node_modules/scrapoxy/node_modules/bluebird/js/release/promise.js:561:18)
    at Promise._settlePromise0 (/usr/lib/node_modules/scrapoxy/node_modules/bluebird/js/release/promise.js:606:10)
    at Promise._settlePromises (/usr/lib/node_modules/scrapoxy/node_modules/bluebird/js/release/promise.js:685:18)
    at Async._drainQueue (/usr/lib/node_modules/scrapoxy/node_modules/bluebird/js/release/async.js:138:16)
    at Async._drainQueues (/usr/lib/node_modules/scrapoxy/node_modules/bluebird/js/release/async.js:148:10)
    at Immediate.Async.drainQueues (/usr/lib/node_modules/scrapoxy/node_modules/bluebird/js/release/async.js:17:14)
    at runCallback (timers.js:649:20)
    at tryOnImmediate (timers.js:622:5)
    at processImmediate [as _immediateCallback] (timers.js:594:5)

The image is deleted when the ip is blacklisting. You need to create an another one.

Fix useragent of instance at instanciation

Useragent is never changed because instance are always deleted, never restarted to change the IP.

So, we can set the UA in the constructor.

accept the proxy CONNECT method to support HTTPS

There is no auth on the instance

See https://github.com/fabienvauchelles/scrapoxy/blob/master/tools/install/proxy.js

=> The Authorization Check on headers is missing !

AWS/EC2: Make a public ami compatible with t2.nano

@fabienvauchelles I found that your ami is compatible with t1.micro only (which costs 0.02$/hour - too pricey). I tried to use it with a t2.nano (which costs 0.0063$/hour - much cheaper) but the "Virtualization type" is not compatible (because t2.nano requires hvm instead of paravirtual).

Could you make an ami for t2.nano?

PS: I followed your guide for DO to create an ami worked for t2.nano. But I don't know how to make "Scrapoxy use HTTP ping instead of a TCP ping." Please advice.

Cannot copy the AWS ami from Ireland region to another

Scrapoxy waiting for 120 secs for every new spider, even if the proxy ip servers are running

Hi,

i am trying to run multiple spiders at once, so when a new spider is started, scrapoxy is making it wait 120 secs initially to proceed the scrapping. why not it just checks weather it is alrady fully scaled so that spiders doesn't need to wait.

Use Multiple Providers and Regions

Would it make sense to be able to specify several regions, instead of just one, in the configuration file?

This sounds better to me, to have servers distributed around the world, for more availability and more different IPs, what do you think?

Instace not alive - Scrapoxy unreachable

Hey guys,

My scrapoxy is not reachable.
I made all the configs, but

2017-03-11T11:51:26.350Z - debug: [Instance/41975420] changeAlive: false => false
2017-03-11T11:51:35.100Z - debug: [Manager] checkInstances
2017-03-11T11:51:36.146Z - debug: [Manager] adjustInstances: required:1 / actual:1

My instance it's not alive.
I've already read all other issues with similar subject, but neither could help me.

I don't know what should I do. I have tested through docker and npm and all results the same.

Replace the useragent list by a fresh one

New suggestion:

Firefox / Macos:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:50.0) Gecko/20100101 Firefox/50.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:49.0) Gecko/20100101 Firefox/49.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:48.0) Gecko/20100101 Firefox/48.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:48.0) Gecko/20100101 Firefox/48.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:47.0) Gecko/20100101 Firefox/47.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:46.0) Gecko/20100101 Firefox/46.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:46.0) Gecko/20100101 Firefox/46.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Firefox/45.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11; rv:45.0) Gecko/20100101 Firefox/45.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:43.0) Gecko/20100101 Firefox/43.0

Firefox / Linux:

Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0

Firefox / Windows:

Mozilla/5.0 (Windows NT 10.0; rv:50.0) Gecko/20100101 Firefox/50.0
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0
Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:43.0) Gecko/20100101 Firefox/43.0
Mozilla/5.0 (Windows NT 6.3; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0

Chrome / Linux:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.59 Safari/537.36

Safari / Windows:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36 Edge/13.10586
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240
Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36
Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36

IE / Windows :

Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko
Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko
Mozilla/5.0 (Windows NT 10.0; Trident/7.0; rv:11.0) like Gecko
Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko

What do you think ?

Removed instances are still added to alive pool

Hi Fabien!
This morning, despite the solving of issue #36, our scrapoxy logs were full of connect ETIMEDOUT errors again. It occurs after an adjusting of instances from the manager. It is almost exactly the same issue:

Scaling decreases
Start to remove a pool of instances from the provider
Change the alive status to false and remove the instance from the alive pool
Make a valid ping to the instance, change the alive status to true and re-add the instance to the alive pool
Finish to remove the instance

02:10:06.172	[Manager] adjustInstances: required:10 / actual:10
02:10:14.126	[Manager] checkInstances
02:10:14.800	[Manager] checkInstances
02:10:15.868	[Manager] adjustInstances: required:1 / actual:10
02:10:15.868	[Manager] adjustInstances: remove 9 instances
02:10:15.870	[ProviderOVHCloud] removeInstances: models= 0=4364c674-4beb-4947-a74c-6d5ecf174db7
02:10:24.823	[Instance/4364c674-4beb-4947-a74c-6d5ecf174db7] checkAlive: true / -
02:10:25.555	[Manager] checkInstances: remove: [email protected]:3128
02:10:25.847	[Instance/4364c674-4beb-4947-a74c-6d5ecf174db7] Instance is alive. Remove crash timer
02:10:25.555	[Instance/4364c674-4beb-4947-a74c-6d5ecf174db7] Instance is down. Crash timer starts
02:10:25.555	[Instance/4364c674-4beb-4947-a74c-6d5ecf174db7] changeAlive: true => false
02:10:25.847	[Instance/4364c674-4beb-4947-a74c-6d5ecf174db7] changeAlive: false => true

I think we should use instance.remove() in adjustInstances to use the newly created removing flag:

function adjustInstances() {
  const managedCount = self._managedInstances.size;
  winston.debug('[Manager] adjustInstances: required:%d / actual:%d', self._config.scaling.required, managedCount);
  if (managedCount > self._config.scaling.required) {
    // Too much
    const count = managedCount - self._config.scaling.required;

    winston.debug('[Manager] adjustInstances: remove %d instances', count);

    // Instead of:
    // const models = _(Array.from(self._managedInstances.values()))
    //                    .sample(count)
    //                    .map((instance) => instance.model) // get function
    //                    .filter((model) => !model.locked) // only unlocked instance can be removed
    //                    .value();
    // return self._provider.removeInstances(models);

    // Something like this:
    const instances =_(Array.from(self._managedInstances.values()))
                        .sample(count)
                        .filter((instance) => !instance.model.locked) // only unlocked instance can be removed
                        .value();
    return Promise.map(instances, (instance) => instance.remove());
  }
  else if (managedCount < self._config.scaling.required) {
    /* ... */
  }
}

Although this solution makes several requests instead of just one. What do you think?

Randomize User Agents

This article suggests randomized user agent requests as another layer of concealment.: http://searchnewscentral.com/20110928186/General-SEO/how-to-scrape-search-engines-without-pissing-them-off.html

I don't know if you can randomize location data too?

Any way to load a custom list of proxy IPs? HMA maintains an extensive list: http://proxylist.hidemyass.com/. Could hook into services like Luminati, too: https://luminati.io/

Vscale.io provider

I have added a new provider in my fork - vscale.io. If you are interested - i can make a pr (but help needed with icomoon icon font).

Proxy CONNECT aborted on HTTPS redirection

Hello,

Excellent work, but that does not seem to work with HTTP to HTTPS redirection, for example:
http://github.com/fabienvauchelles/scrapoxy

HTTP :

$ curl -i --proxy http://127.0.0.1:8888 http://github.com/fabienvauchelles/scrapoxy
HTTP/1.1 301 Moved Permanently
content-length: 0
location: https://github.com/fabienvauchelles/scrapoxy
connection: close
x-cache-proxyname: i-ae00b817
Date: Tue, 08 Dec 2015 23:33:56 GMT

HTTPS :

$ curl --proxy http://127.0.0.1:8888 https://github.com/fabienvauchelles/scrapoxy
curl: (56) Proxy CONNECT aborted

Regards,

Error running Gulp on Node 7

Running Gulp with Node v7.5.0 raises the following error:

$ node_modules/.bin/gulp test
module.js:472
    throw err;
    ^

Error: Cannot find module 'internal/fs'
    at Function.Module._resolveFilename (module.js:470:15)
    at Function.Module._load (module.js:418:25)
    at Module.require (module.js:498:17)
    at require (internal/module.js:20:19)
    at evalmachine.<anonymous>:18:20
    at Object.<anonymous> (/.../node_modules/vinyl-fs/node_modules/graceful-fs/fs.js:11:1)
    at Module._compile (module.js:571:32)
    at Object.Module._extensions..js (module.js:580:10)
    at Module.load (module.js:488:32)
    at tryModuleLoad (module.js:447:12)

This should be fixed in Gulp 3.9.1, which is actually depended upon, however I suspect there are some outdated dependencies in npm-shrinkwrap. Installing dependecies with Yarn (which ignores npm-shrinkwrap) fixes the issue.

Documentation for Digital Ocean is outdated

Parts to review

SSH Key
Snapshot creation

Test Zapier Integration 2

top 2

Test slack integration

add LINODE support

Utiliser scrapoxy avec Squid

Bonjour,

Je voudrais utiliser votre solution pour manager des proxys Squid sur plusieurs VM. Dans votre documentation vous dite que cela est possible. Comment cela fonctionne t-il ? Quelle configuration doit-je avoir dans le conf.json car du coup je n'ai pas de provider AWS, OVH ou DigitalOcean ?

Cordialement.

Provider parameters should be forced to lowercase

Example for DO:

It is possible to set:

region: 'LON1'

and the region is 'lon1'.

But the region's filter won't work!

Error with instances

I am experiencing cases where all the new instances created are dead since yesterday. I tried to stop and start scrapoxy and also tried to kill the instances. I get this error
2016-10-06T22:57:22.952Z - debug: [Manager] checkInstances
2016-10-06T22:57:23.083Z - debug: [Manager] adjustInstances: required:10 / actual:1
2016-10-06T22:57:23.083Z - debug: [Manager] adjustInstances: add 9 instances
2016-10-06T22:57:23.084Z - debug: [ProviderDigitalOcean] createInstances: count=9
2016-10-06T22:57:23.210Z - debug: [ProviderDigitalOcean] createInstances: actualCount=2
2016-10-06T22:57:24.953Z - debug: [Manager] checkInstances
2016-10-06T22:57:25.306Z - debug: [Instance/28143199] checkAlive: false / -
2016-10-06T22:57:25.306Z - debug: [Pinger] ping: hostname=198.211.104.194 / port=3128
2016-10-06T22:57:25.481Z - error: [Manager] Error: Cannot update or adjust instances: TypeError: Cannot read property 'ip_address' of undefined
at _.map (/usr/lib/node_modules/scrapoxy/server/providers/digitalocean/index.js:63:43)
at arrayMap (/usr/lib/node_modules/scrapoxy/node_modules/lodash/index.js:1406:25)
at Function.map (/usr/lib/node_modules/scrapoxy/node_modules/lodash/index.js:6710:14)
at summarizeInfo (/usr/lib/node_modules/scrapoxy/server/providers/digitalocean/index.js:60:22)
at tryCatcher (/usr/lib/node_modules/scrapoxy/node_modules/bluebird/js/release/util.js:16:23)
at Promise._settlePromiseFromHandler (/usr/lib/node_modules/scrapoxy/node_modules/bluebird/js/release/promise.js:504:31)
at Promise._settlePromise (/usr/lib/node_modules/scrapoxy/node_modules/bluebird/js/release/promise.js:561:18)
at Promise._settlePromise0 (/usr/lib/node_modules/scrapoxy/node_modules/bluebird/js/release/promise.js:606:10)
at Promise._settlePromises (/usr/lib/node_modules/scrapoxy/node_modules/bluebird/js/release/promise.js:685:18)
at Async._drainQueue (/usr/lib/node_modules/scrapoxy/node_modules/bluebird/js/release/async.js:138:16)
at Async._drainQueues (/usr/lib/node_modules/scrapoxy/node_modules/bluebird/js/release/async.js:148:10)
at Immediate.Async.drainQueues (/usr/lib/node_modules/scrapoxy/node_modules/bluebird/js/release/async.js:17:14)
at runCallback (timers.js:574:20)
at tryOnImmediate (timers.js:554:5)
at processImmediate as _immediateCallback

Digital Ocean: API destroy errors

Hi team,

Today I got this message from DO:

It seems that a script you have running is requesting multiple "destroy" commands for each droplet, causing event failures. We will need to look into the system allowing it, but I wanted to ask if you would be able to adjust the script to ensure that it only sends one destroy command for a single droplet. It's setting off a few of our monitors ;)

Is this a bug of scrapoxy?

Please advice. Thanks in advance.

eu-west-1 / t2.nano: ami-06220275, no permission to copy AMI

when try to copy public image of t2.nano > ami-06220275, aws says 'You do not have permission to access the storage of this ami'

How to manage blacklisted requests ?

Hello,

I cannot make scrapoxy work for me. If I run

scrapoxy test http://localhost:8888

I get the error

error: [Test] Error: Cannot get IP address: 407: [Master] Error: No running instance found

However, scrapoxy passes from 1 to 3 instances right after the command. The GUI is like this:

I have tried both with digitalocean and AWS. By the way, trying

curl --proxy http://127.0.0.1:8888 http://api.ipify.org

also yields

[Master] Error: No running instance found

Am I missing something obvious?

Thanks

No running instance or connect refused for few minutes

Hi,

Fantastic work on this library. We are using it for quite some times now and it works beautifully 99% of the time.
However, we still have some weird errors sometimes and for some request:

407 response with 'No running instance found' in the body
500 response with 'Error in proxy request: Error: connect ECONNREFUSED' in the body

My guess is that scrapoxy returns theses errors when it is refreshing its internal map of ip (once every hour, is that correct ?). Do you know how to handle these ? Do we sleep and retry in a while ?

OVH unknown status

Hi,

We are using OVH to host our proxies and it seems that scrapoxy sometimes encounter an unknown status from the OVH API. The related log line is the following: 'error: [ProviderOVHCloud] Unknown status: DELETED". Do you know why is this happening ?

Error: Cannot update or adjust instances

Hello, I just installed Scrapoxy on Digital Ocean, but I'm seeing this error message:

2016-11-15T20:20:24.518Z - debug: [Manager] checkInstances
2016-11-15T20:20:25.041Z - debug: [Manager] adjustInstances: required:5 / actual:1
2016-11-15T20:20:25.041Z - debug: [Manager] adjustInstances: add 4 instances
2016-11-15T20:20:25.042Z - debug: [ProviderDigitalOcean] createInstances: count=4
2016-11-15T20:20:25.559Z - debug: [ProviderDigitalOcean] createInstances: actualCount=9
2016-11-15T20:20:25.560Z - error: [Manager] Error: Cannot update or adjust instances: askedInstances=4, security=true

The actualCount=9 part surprised me, in my conf I have a limit of 5. Only 1 droplet is created by Scrapoxy.

Also, if I test it:

$ scrapoxy test http://localhost:8888
2016-11-15T20:21:42.193Z - error: [Test] Error: Cannot get IP address: 407: [Master] Error: No running instance found

A screenshot might help?

Let me know how I can provide you more feedback about this!

add HEROKU support

For small use, Heroku could be a good provider,

Basic app are free and we can have 1000 hours/months

Destroy random instance when shrinking

Scrapoxy is brilliant!

Just one improvement I'd love to see. I have a config as follows:

"instance": {
"port": 3128,
"scaling": {
"min": 1,
"max": 4
},
"autorestart": {
"minDelay": 1800000,
"maxDelay": 3600000
}
},

When I push load through Scrapoxy and it shrinks down to 1 instance, it's always the oldest (or first) instance that stays - so once every few weeks I have to go in and destroy it.
Is it possible to randomise which instances get destroyed, so an instance doesn't hang around for a long time?

Thanks,
Adam

Error: request error from target (connect ETIMEDOUT)

Hi @fabienvauchelles,

I got the error below when using Scrapoxy (sorry for censoring some infor)

4|scrapoxy | 2016-12-20T03:14:48.412Z - error: [Master] Error: request error from target (GET {{https_url}} on instance i-...@...}): { Error: connect ETIMEDOUT ...:3128
4|scrapoxy |     at Object.exports._errnoException (util.js:1022:11)
4|scrapoxy |     at exports._exceptionWithHostPort (util.js:1045:20)
4|scrapoxy |     at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1087:14)
4|scrapoxy |   code: 'ETIMEDOUT',
4|scrapoxy |   errno: 'ETIMEDOUT',
4|scrapoxy |   syscall: 'connect',
4|scrapoxy |   address: '...',
4|scrapoxy |   port: 3128 }

The stats of scrapoxy at that point:

Please advice. Thank you so much :)

Using Scrapoxy with PhantomJS for https requests

Hi @fabienvauchelles

Thank you for your quick response to my previous issues and thank you for this great open source work.

In your documentation you have provided a way of using https with nodejs and python. Is there a way of using it in Phantomjs for https request?

Force crashed instance to be removed

Force crashed instance to be removed, in every case:

In instance.js, remove the if STARTED condition.

AWS warning: ports 22 not open

Hi! I'm newbie on AWS (free plan).
Just found Scrapoxy after being blocked on my scraping task (price monitoring) using Content Grabber.
Have followed the Standard tutorial for AWS/EC2, but changed image region to Sao Paulo.

I don't know how to perform Step 5 (edit conf.json) (How?).
So I clicked Connect to instance to see what happens.
Warning message: "You may not be able to connect to this instance as ports 22 may need to be open in order to be accessible. Your current security groups don't have ports 22 open."

Please, can you help me?
Thanks!

Digital Ocean: Cannot create more than 10 droplets at a time

Hi Fabien,

I am not able to create more than 10 droplets at a time. Could you please let me know why I have this issue?

Stuck trying to install Scrapoxy in Digital Ocean

I'm trying to install Scrapoxy in Digital Ocean, as I have more experience with DO than with AWS. So, I skipped the Quick Start which uses DO and jumped to the Digital Ocean documentation.

I'm now stuck here:

http://docs.scrapoxy.io/en/master/standard/providers/digitalocean/index.html#options-digitalocean

I just created the server, installed the software and created the image, but the next step says I have to edit "conf.json", but where is that file?

I mean, do I have to install a client in my dev machine as well? Or is that a config file in the server?

Error 407 - No running instance found

Hi,

Thanks for your jobs! Scrapoxy is great!

However, I get an error 407 (No running instance found) every few hours / days. Usually, after 10 or 15 minutes this error disappear, but today i'm stucked on this error since 5 hours. So this time I'm able to better understand what's happen.

In my scrapoxy, I have 14 instances (droplets from digital ocean), and all of them are started in the provider (green check in scrapoxy), but they are all in the dead status in scrapoxy (sometimes 1 is alive and restart like usually). All the others instance are dead and are stuck in this states without any reboot, while they should reboot every 10 minutes. In my configuration, the minimum instances is set 9, the max is 14, and the current required is 14. In Digital Ocean, all the 14 instances are live and running since 5 hours.

You can see attached a screenshot of my scrapoxy dashboard.

Thanks!

Log requests

I'm having 404 with an OVH config:

2016-04-23T14:21:46.567Z - error: [Manager] error: 404

Is there a way of logging the request?

Digital Ocean: Scrapoxy creating instances from different regions of the same provider

Hello Fabien,

I have an issue with scrapoxy where the scrapoxy UI is showing instances from different regions on all the scrapoxy installed on different servers. All are created from different images (from different data centers) and the conf.json has the Digital Ocean token of that image respectively from that data center. I tried creating a new droplet and then the image and installed scrapoxy and when I started scrapoxy, it started showing instances that were from different token and image of Digital Ocean. Hope you understood the issue. Let me know what I can do to resolve this.

Thanks!
Chandana

PM2 timeout

Hello,

In your documentation (http://docs.scrapoxy.io/en/master/advanced/startup/index.html?highlight=pm2), you have written that PM2 does not properly kill the instances on restart/stop but we found out that this is related to the "kill_timeout" setting of PM2. By default this setting is too low (1600ms) to let scrapoxy have enough time to stop its instances before being killed by PM2.

Increasing this timeout setting to something (in the pm2 configuration file for instance) like 30 seconds should solve the issue. Maybe you can try yourself and update your documentation accordingly ?

add vultr and/or Scaleway provider

Hi,

I would like to have VULTR or Scaleway because I need French IPs and this provider isn't expensive (not like OVH) with a datacenter in Paris.

I saw these providers have snapshot like digitalocean, So maybe to add these providers is pretty easy, so How can I help you to add this provider?

I didn't see in the documentation how can I add new providers.
Thanks

fabienvauchelles / scrapoxy Goto Github PK

scrapoxy's Introduction

Scrapoxy

What is Scrapoxy?

Features

☁️ Datacenter Providers with easy installation ☁️

🌐 Proxy Services 🌐

💻 Hardware materials 💻

📜 Free Proxy Lists 📜

⏰ Timeout free ⏰

🔄 Auto-Rotate proxies 🔄

🏃 Auto-Scale proxies 🏃

🍪 Sticky sessions on Browser 🍪

🚨 Ban management 🚨

📡 Traffic interception 📡

📊 Traffic monitoring 📊

🌍 Coverage monitoring 🌍

🚀 Easy-to-use and production-ready 🚀

🔓 Free and Open Source 🔓

Documentation

Contributors

Sponsorship

Licence

Acknowledgements

Follow-up

scrapoxy's People

Contributors

Stargazers

Watchers

Forkers

scrapoxy's Issues

Recommend Projects

Recommend Topics

Recommend Org