Giter Site home page Giter Site logo

scrapy-proxies's Introduction

Random proxy middleware for Scrapy (http://scrapy.org/)

Processes Scrapy requests using a random proxy from list to avoid IP ban and improve crawling speed.

Get your proxy list from sites like http://www.hidemyass.com/ (copy-paste into text file and reformat to http://host:port format)

Install

The quick way:

pip install scrapy_proxies

Or checkout the source and run

python setup.py install

settings.py

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"

For older versions of Scrapy (before 1.0.0) you have to use scrapy.contrib.downloadermiddleware.retry.RetryMiddleware and scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware middlewares instead.

Your spider

In each callback ensure that proxy /really/ returned your target page by checking for site logo or some other significant element. If not - retry request with dont_filter=True

if not hxs.select('//get/site/logo'):
    yield Request(url=response.url, dont_filter=True)

scrapy-proxies's People

Contributors

aivarsk avatar alibozorgkhan avatar aorzh avatar cbrz avatar dmikhr avatar hhaoyan avatar kangoo13 avatar toniprada avatar yijingping avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapy-proxies's Issues

Proxy is set only if proxy_user_pass exists

I looked through your code and found out that proxy meta value was set only when there was proxy_user_pass from proxy record:

def process_request(self, request, spider):
	# Don't overwrite with a random one (server-side state for IP)
	if 'proxy' in request.meta:
		if request.meta["exception"] is False:
			return
	request.meta["exception"] = False
	if len(self.proxies) == 0:
		raise ValueError('All proxies are unusable, cannot proceed')

	if self.mode == ProxyMode.RANDOMIZE_PROXY_EVERY_REQUESTS:
		proxy_address = random.choice(list(self.proxies.keys()))
	else:
		proxy_address = self.chosen_proxy

	proxy_user_pass = self.proxies[proxy_address]

	if proxy_user_pass:
		request.meta['proxy'] = proxy_address
		basic_auth = 'Basic ' + base64.b64encode(proxy_user_pass.encode()).decode()
		request.headers['Proxy-Authorization'] = basic_auth
	else:
		log.debug('Proxy user pass not found')
	log.debug('Using proxy <%s>, %d proxies left' % (
			proxy_address, len(self.proxies)))

Have I missed smth?

How to check that a proxy is really being used?

In the process_request function the proxy is passed to the request only if has an proxy_user_pass, otherwise only print that the proxy is beign used and which are left. That means that a proxy like https://176.37.14.252:8080 does not work?

This is the function:

def process_request(self, request, spider):
     # Don't overwrite with a random one (server-side state for IP)
     if 'proxy' in request.meta:
         if request.meta["exception"] is False:
             return
     request.meta["exception"] = False
     if len(self.proxies) == 0:
         raise ValueError('All proxies are unusable, cannot proceed')

     if self.mode == Mode.RANDOMIZE_PROXY_EVERY_REQUESTS:
         proxy_address = random.choice(list(self.proxies.keys()))
     else:
         proxy_address = self.chosen_proxy

     proxy_user_pass = self.proxies[proxy_address]

     if proxy_user_pass:
         request.meta['proxy'] = proxy_address
         basic_auth = 'Basic ' + base64.b64encode(proxy_user_pass.encode()).decode()
         request.headers['Proxy-Authorization'] = basic_auth
     else:
         log.debug('Proxy user pass not found')
     log.debug('Using proxy <%s>, %d proxies left' % (
             proxy_address, len(self.proxies)))

when hanppen 403,why not can del bad proxy

2017-07-12 14:35:33 [scrapy.proxies] DEBUG: Using proxy http://208.92.94.191:1080, 91 proxies left
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)
/search/category/2/10/g251p6?aid=79417082%2C20944119%2C67545588%2C512124%2C4665606%2C2517868%2C68124250%2C77336676%2C19331058%2C91955011%2C52802565%2C92076417&cpt=79417082%2C20944119%2C67545588%2C512124%2C4665606%2C2517868%2C68124250%2C77336676%2C19331058%2C91955011%2C52802565%2C92076417&tc=1 ==================
2017-07-12 14:35:34 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.dianping.com/shop/70170698> (failed 1 times): 403 Forbidden
2017-07-12 14:35:34 [scrapy.proxies] DEBUG: Using proxy http://110.244.119.139:80, 91 proxies left
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)
2017-07-12 14:35:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.dianping.com/shop/507618> (failed 1 times): 403 Forbidden
2017-07-12 14:35:35 [scrapy.proxies] DEBUG: Using proxy http://125.89.121.179:808, 91 proxies left
Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0

https proxy issue

I have an issue where proxies are not being used when accessing https:// websites (uses my actual ip).

I've verified that my proxies do support https:// (by setting the env variable set HTTPS_PROXY=proxy address works).

Setting proxies in my proxy_list to http:// and https:// does not make a difference.

Retry won't pick a new proxy.

Hi,
I use a proxies list to run my spider. However, it failed to pick a new porxy when the connection failure happens.

2016-09-20 17:48:25 [scrapy] DEBUG: Using proxy http://xxx.160.162.95:8080, 3 proxies left
2016-09-20 17:48:27 [scrapy] INFO: Removing failed proxy http://xxx.160.162.95:8080, 2 proxies left
2016-09-20 17:48:27 [scrapy] DEBUG: Retrying <GET http://jsonip.com/> (failed 1 times): User timeout caused connection failure: Getting http://jsonip.com/ took longer than 2.0 seconds..
2016-09-20 17:48:29 [scrapy] INFO: Removing failed proxy http://xxx.160.162.95:8080, 2 proxies left
2016-09-20 17:48:29 [scrapy] DEBUG: Retrying <GET http://jsonip.com/> (failed 2 times): User timeout caused connection failure: Getting http://jsonip.com/ took longer than 2.0 seconds..
2016-09-20 17:48:31 [scrapy] INFO: Removing failed proxy http://xxx.160.162.95:8080, 2 proxies left
2016-09-20 17:48:31 [scrapy] DEBUG: Gave up retrying <GET http://jsonip.com/> (failed 3 times): User timeout caused connection failure: Getting http://jsonip.com/ took longer than 2.0 seconds..

Please help to fix this problem.
thanks a lot

Proxy Error // Start and Stop Time for Requests

When I attempt to request the proxy I get a "KeyError: 'proxy'". Previously, I was able to get the IP address prior to using the proxies. Is there anyway to to get the proxy address that is used.

   def parse_item(self, response):
    	item = {}
    	item['url'] = response.url
    	item['download_latency'] = download_latency = response.request.meta['download_latency']
    	**item['proxy'] = response.request.meta['proxy']**

Separate question from the previous, I was wondering if there was any way to get the start and stop time for a request. I'm trying to get a better understanding of concurrent_requests and how best to maximize request / second.

process exception type

process_exception(request, exception, spider)
Scrapy calls process_exception() when a download handler or a process_request() (from a downloader middleware) raises an exception (including an IgnoreRequest exception)

ImportError: No module named scrapy_proxies

I'm getting this error when I run:

Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks
result = g.send(result)
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 90, in crawl
six.reraise(*exc_info)
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 72, in crawl
self.engine = self._create_engine()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 97, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 69, in init
self.downloader = downloader_cls(crawler)
File "/Library/Python/2.7/site-packages/scrapy/core/downloader/init.py", line 88, in init
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/Library/Python/2.7/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/Library/Python/2.7/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/Library/Python/2.7/site-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/init.py", line 37, in import_module
import(name)
ImportError: No module named scrapy_proxies

ValueError: All proxies are unusable, cannot proceed

I'm getting this error:

ValueError: All proxies are unusable, cannot proceed

2017-05-13 14:09:02 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapy_bets)
2017-05-13 14:09:02 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_bets.spiders', 'FEED_URI': 'matches.json', 'SPIDER_MODULES': ['scrapy_bets.spiders'], 'RETRY_TIMES': 10, 'BOT_NAME': 'scrapy_bets', 'RETRY_HTTP_CODES': [500, 503, 504, 400, 403, 404, 408], 'FEED_FORMAT': 'json'}
2017-05-13 14:09:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-05-13 14:09:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy_proxies.RandomProxy',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-13 14:09:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-13 14:09:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-05-13 14:09:02 [scrapy.core.engine] INFO: Spider opened
2017-05-13 14:09:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-13 14:09:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-13 14:09:02 [scrapy.core.scraper] ERROR: Error downloading <GET http://url_to_parse>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
File "/usr/local/lib/python2.7/site-packages/scrapy_proxies/randomproxy.py", line 63, in process_request
raise ValueError('All proxies are unusable, cannot proceed')
ValueError: All proxies are unusable, cannot proceed
2017-05-13 14:09:02 [scrapy.core.scraper] ERROR: Error downloading <GET http://url_to_parse>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
File "/usr/local/lib/python2.7/site-packages/scrapy_proxies/randomproxy.py", line 63, in process_request
raise ValueError('All proxies are unusable, cannot proceed')
ValueError: All proxies are unusable, cannot proceed
2017-05-13 14:09:02 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-13 14:09:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
'downloader/exception_type_count/exceptions.ValueError': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 5, 13, 13, 9, 2, 915138),
'log_count/DEBUG': 1,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2017, 5, 13, 13, 9, 2, 694730)}
2017-05-13 14:09:02 [scrapy.core.engine] INFO: Spider closed (finished)

use_real_when_empty':False

Hello, on Reposhub about this I found setting:
'use_real_when_empty':False,
is it works? I've no found function inside..

Error

2018-07-26 10:26:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-26 10:26:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-26 10:26:02 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-07-26 10:26:02 [scrapy.proxies] DEBUG: Using proxy https://185.93.3.70:8080, 1 proxies left
2018-07-26 10:26:03 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://piknu.com/u/isabel_sanzz/similar> (failed 1 times): 403 Forbidden
2018-07-26 10:26:03 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-07-26 10:26:03 [scrapy.proxies] DEBUG: Using proxy https://185.93.3.70:8080, 1 proxies left

There has a problem when the character “@” within password

There has a problem when the character “@” within password, maybe we should make the regex pattern more compatible? :) Here is my solution:

parts = re.match('(\w+://)([^:]+?:.+@)?(.+)', line.strip())

instead of

parts = re.match('(\w+://)([^:]+?:[^@]+?@)?(.+)', line.strip())

when the flie have empty line this code many case error

when the flie have empty line this code many case error ?
if parts.group(2): exceptions.AttributeError: 'NoneType' object has no attribute 'group'
so,before group we can check it? like this

if parts:
    if parts.group(2):
    ...

TypeError: memoryview: a bytes-like object is required, not 'str'

Hi, I am getting the error below when using the DOWNLOADER_MIDDLEWARES indicated in the ReadMe (I added a proxy list, etc..). Read a bunch of threads on SO but couldn't fix my issue.

Appreciate any help
thanks

Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/base64.py", line 517, in _input_type_check
m = memoryview(s)
TypeError: memoryview: a bytes-like object is required, not 'str'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
File "/usr/local/lib/python3.6/site-packages/scrapy_proxies/randomproxy.py", line 70, in process_request
basic_auth = 'Basic ' + base64.encodestring(proxy_user_pass)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/base64.py", line 547, in encodestring
return encodebytes(s)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/base64.py", line 534, in encodebytes
_input_type_check(s)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/base64.py", line 520, in _input_type_check
raise TypeError(msg) from err
TypeError: expected bytes-like object, not str
2017-10-16 23:19:20 [scrapy.core.engine] INFO: Closing spider (finished)
2017-10-16 23:19:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/builtins.TypeError': 1,
'finish_reason': 'finished

Not working with scrapy-splash

I'm using scrapy-splash to crawl an ajax site. And when using scrapy-proxies, it seems that the request is not sending through the proxy, the proxy is not working at all.

proxylist cant be loaded on Scrapy Cloud

I try several time diferente ways

I add the file "proxylist.txt" in the same folder than setting than the project in addition i upload it to "https://dl.dropboxusercontent.com/s/esdm19mnvz2yguf/proxylist.txt"

I substitute the name in the:
PROXY_LIST = 'https://dl.dropboxusercontent.com/s/esdm19mnvz2yguf/proxylist.txt'
or
PROXY_LIST = 'proxylist.txt'
or
PROXY_LIST = '/proxylist.txt'
PROXY_LIST = '../proxylist.txt'

if i do it like PROXY_LIST = 'proxylist.txt' in my PC, it works like a charm but not once i load it in Scrapy Cloud.

Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 90, in crawl
six.reraise(*exc_info)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 72, in crawl
self.engine = self._create_engine()
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 97, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 69, in init
self.downloader = downloader_cls(crawler)
File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/init.py", line 88, in init
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/usr/local/lib/python2.7/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/usr/local/lib/python2.7/site-packages/scrapy/middleware.py", line 36, in from_settings
mw = mwcls.from_crawler(crawler)
File "/app/python/lib/python2.7/site-packages/scrapy_proxies/randomproxy.py", line 55, in from_crawler
return cls(crawler.settings)
File "/app/python/lib/python2.7/site-packages/scrapy_proxies/randomproxy.py", line 35, in init
fin = open(self.proxy_list)
IOError: [Errno 2] No such file or directory: '../proxylist.txt'

please i need some help.

proxyfile name as an argument

Can I pass the name of the proxyfile as a variable to scrapy?
So if I'm running multiple crawlers at the same time, I would be able to use different list of proxies.

Thank you

Properly formatting ProxyList.txt file

Hey there,

Just looking for some basic info. I'm trying to figure out how to properly build my ProxyList.txt file. I've got the IP addresses from HMA pro but i'm not sure how to locate the port which goes at the end. I've tried searching on google how to find the ports but still not sure. Is there another free service i could use to find the information I need(IP address and port)?

Thanks a ton

I am confused about the non-username-password proxy logic

if proxy_user_pass:
            request.meta['proxy'] = proxy_address
            basic_auth = 'Basic ' + base64.b64encode(proxy_user_pass.encode()).decode()
            request.headers['Proxy-Authorization'] = basic_auth
else:
        log.debug('Proxy user pass not found')
        log.debug('Using proxy <%s>, %d proxies left' % (
        proxy_address, len(self.proxies)))

I am very confused here as a noob python developer. From this part of the logic in randomProxy file. It seems like if the proxy provided in the list.txt is in the http://username:password@host2:port format, then it will work by assigning proxy_address to the request, otherwise, do nothing but logging debug...

What am I missing here?

Response never received errors

Does anyone else experience timeout errors, specifically immediately after redirects?

I've only set this up today, but specifically https://www.game.co.uk/en/hardware/xbox-series-x/?contentOnly=&inStockOnly=true&listerOnly=&pageSize=100

I can fetch it ok with Scrapy fetch, but if I try to use a spider that crawls the URL, I hit a 302 redirect and my crawl just completely errors out from that point with immediate failures "response never received". It's not long timeouts, it's literally just erroring immediately.

Please could somebody help me? I'm fairly new to this and I have no idea what the cause may be.

I'm using a pool of 10 http proxies on port 80

How to check failure manually?

Is there any way to check failure through something except HTTP status code?
Maybe based on response body, headers or something else !?

Getting proxy and scraping in one Scrapy project

Hi, Aivars!
I use your Random proxy middleware for Scrapy - scrapy_proxy. It works fine, thank you a lot!

At first, I get list.txt (list of proxies) by scraping free-proxy-site (without proxy rotating)
Then I make scraping of another site, (with scrapy_proxy)
When I run it by two different Scrapy projects it works well.

I tried to run it together in one Scrapy project, unfortunately, it doesn't work. Probably because in this case it tries to use list.txt for proxy rotating which is empty at that moment by request to free-proxy-site.
Is there another way around to handle it?

Thank you

Verify slow crawling

I gave a list of about 300 proxies but set CONCURRENT_REQUESTS = 64. Still it seems that crawling is very slow (like 1 page every few seconds on average), much slower than not using any proxy at all. Of course DOWNLOAD_DELAY is low.

Looking at it, it seems that people should usually also increase CONCURRENT_REQUESTS_PER_DOMAIN in these cases (i.e. with a list of many possibly bad proxies), but even then it's still pretty slow.

Change proxy on http code 429 and dont die

Hi, is it possible to change proxy on http code 429?

If i get 429 error, i want to change to another proxy from the list

So i want run PROXY_MODE = 1 but if i get 429, i want to check/change to new proxy

How to force remove proxy ??

This is not an Issue.
Sometimes, when we open an URL we can get HTTP.RESPONSE<200>
but gives 0 result... or i can say, i got banned from that website.

Is there any way to force remove the proxy item?

Thank you! Any help accepted :)

Dynamic Proxy restart

Is it possible to restart the proxylist if it gets to 0? I have dynamic proxies that refresh every 15 min, so I want scrapy to restart the list if len(self.proxies) == 0.

Thanks!

Passing proxy via meta in start request throws KeyError: 'exception'

I see this is caused by the line no 83.

 if 'proxy' in request.meta:
            if request.meta["exception"] is False:
                return

If we have used proxy in start requests function then this issue arises which makes sense because exception is not defined in meta up to this point for our first request.

I guess most of us either use a random proxy or custom proxy. So no one ever bothered about it.
I think the line 83 is important because it enables to change proxies in each retry or after exception.

def start_requests(self):

        yield scrapy.Request('http://quotes.toscrape.com/', callback=self.parse, meta={'proxy': 'http://xxxx:xxxx@xxxx:xxxx'})

Also to change the proxy in retry. Comment out this in process_exception. #15

       if 'proxy' not in request.meta:
             return

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.