aivarsk / scrapy-proxies Goto Github PK

Random proxy middleware for Scrapy

License: MIT License

Python 100.00%

scrapy-proxies's Introduction

Random proxy middleware for Scrapy (http://scrapy.org/)

Processes Scrapy requests using a random proxy from list to avoid IP ban and improve crawling speed.

Get your proxy list from sites like http://www.hidemyass.com/ (copy-paste into text file and reformat to http://host:port format)

Install

The quick way:

pip install scrapy_proxies

Or checkout the source and run

python setup.py install

settings.py

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"

For older versions of Scrapy (before 1.0.0) you have to use scrapy.contrib.downloadermiddleware.retry.RetryMiddleware and scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware middlewares instead.

Your spider

In each callback ensure that proxy /really/ returned your target page by checking for site logo or some other significant element. If not - retry request with dont_filter=True

if not hxs.select('//get/site/logo'):
    yield Request(url=response.url, dont_filter=True)

scrapy-proxies's People

Contributors

Stargazers

Watchers

Forkers

scraping-xx netconstructor bigdata-tools big-data alibozorgkhan avohmincevs teserak man27382210 jai-prakash-singh publicbull monday0rsunday scrapt dmikhr bihicheng rdndnl vicjung wyrover theringer yasirsabri mabongdem lichesser bopo wenkezhou waheedkhan05 yetone vallimoceniche weijiezheng booox liwei123o0 toktik cedricporter cusion last2003 yorin sherpic cxl008 zhiyue-archive sundisee 1060460048 gustaptavares geeseek zhahaoyu hhaoyan ios0x00 muzamilqadir786 dollydagr xczswt1993 zhangjingpu appscluster ii0 wyvern92 tt594tt finallybiubiu nsagar1990 ayonliu haika izhun mtaziz ahmedfawzy denisvlr zsmj513 smartdj nanjinghhu stevencoding pynew handanchen liujinlingchn danmash santoshghimire howardyan93 zjc-enigma jdk6979 xsr-thu myzine hosnid wateroot jmclemore368 jagdeep786 thocell veterun chiragmatkar hanxlinsist jingshishengdian nicozhang hillyu rock999 aorzh acamtech fpaboim thunderex pchief lichengdong xiaosimao cyri1s ppln kangoo13 jack2949 toniprada davischan3168 shaynewang

scrapy-proxies's Issues

Proxy is set only if proxy_user_pass exists

I looked through your code and found out that proxy meta value was set only when there was proxy_user_pass from proxy record:

def process_request(self, request, spider):
	# Don't overwrite with a random one (server-side state for IP)
	if 'proxy' in request.meta:
		if request.meta["exception"] is False:
			return
	request.meta["exception"] = False
	if len(self.proxies) == 0:
		raise ValueError('All proxies are unusable, cannot proceed')

	if self.mode == ProxyMode.RANDOMIZE_PROXY_EVERY_REQUESTS:
		proxy_address = random.choice(list(self.proxies.keys()))
	else:
		proxy_address = self.chosen_proxy

	proxy_user_pass = self.proxies[proxy_address]

	if proxy_user_pass:
		request.meta['proxy'] = proxy_address
		basic_auth = 'Basic ' + base64.b64encode(proxy_user_pass.encode()).decode()
		request.headers['Proxy-Authorization'] = basic_auth
	else:
		log.debug('Proxy user pass not found')
	log.debug('Using proxy <%s>, %d proxies left' % (
			proxy_address, len(self.proxies)))

Have I missed smth?

All proxies are unusable, cannot proceed

my proxies.txt file looks like

http://username:password@IP:Port
http://username:password@IP1:Port
.........

there are over 100 dedicated proxies(all are active) but using above library with PROXY_LIST = 'proxies.txt' all
I get All proxies are unusable error.

How to check that a proxy is really being used?

In the process_request function the proxy is passed to the request only if has an proxy_user_pass, otherwise only print that the proxy is beign used and which are left. That means that a proxy like https://176.37.14.252:8080 does not work?

This is the function:

def process_request(self, request, spider):
     # Don't overwrite with a random one (server-side state for IP)
     if 'proxy' in request.meta:
         if request.meta["exception"] is False:
             return
     request.meta["exception"] = False
     if len(self.proxies) == 0:
         raise ValueError('All proxies are unusable, cannot proceed')

     if self.mode == Mode.RANDOMIZE_PROXY_EVERY_REQUESTS:
         proxy_address = random.choice(list(self.proxies.keys()))
     else:
         proxy_address = self.chosen_proxy

     proxy_user_pass = self.proxies[proxy_address]

     if proxy_user_pass:
         request.meta['proxy'] = proxy_address
         basic_auth = 'Basic ' + base64.b64encode(proxy_user_pass.encode()).decode()
         request.headers['Proxy-Authorization'] = basic_auth
     else:
         log.debug('Proxy user pass not found')
     log.debug('Using proxy <%s>, %d proxies left' % (
             proxy_address, len(self.proxies)))

when hanppen 403,why not can del bad proxy

2017-07-12 14:35:33 [scrapy.proxies] DEBUG: Using proxy http://208.92.94.191:1080, 91 proxies left
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)
/search/category/2/10/g251p6?aid=79417082%2C20944119%2C67545588%2C512124%2C4665606%2C2517868%2C68124250%2C77336676%2C19331058%2C91955011%2C52802565%2C92076417&cpt=79417082%2C20944119%2C67545588%2C512124%2C4665606%2C2517868%2C68124250%2C77336676%2C19331058%2C91955011%2C52802565%2C92076417&tc=1 ==================
2017-07-12 14:35:34 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.dianping.com/shop/70170698> (failed 1 times): 403 Forbidden
2017-07-12 14:35:34 [scrapy.proxies] DEBUG: Using proxy http://110.244.119.139:80, 91 proxies left
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)
2017-07-12 14:35:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.dianping.com/shop/507618> (failed 1 times): 403 Forbidden
2017-07-12 14:35:35 [scrapy.proxies] DEBUG: Using proxy http://125.89.121.179:808, 91 proxies left
Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0

https proxy issue

I have an issue where proxies are not being used when accessing https:// websites (uses my actual ip).

I've verified that my proxies do support https:// (by setting the env variable set HTTPS_PROXY=proxy address works).

Setting proxies in my proxy_list to http:// and https:// does not make a difference.

Retry won't pick a new proxy.

Hi,
I use a proxies list to run my spider. However, it failed to pick a new porxy when the connection failure happens.

2016-09-20 17:48:25 [scrapy] DEBUG: Using proxy http://xxx.160.162.95:8080, 3 proxies left
2016-09-20 17:48:27 [scrapy] INFO: Removing failed proxy http://xxx.160.162.95:8080, 2 proxies left
2016-09-20 17:48:27 [scrapy] DEBUG: Retrying <GET http://jsonip.com/> (failed 1 times): User timeout caused connection failure: Getting http://jsonip.com/ took longer than 2.0 seconds..
2016-09-20 17:48:29 [scrapy] INFO: Removing failed proxy http://xxx.160.162.95:8080, 2 proxies left
2016-09-20 17:48:29 [scrapy] DEBUG: Retrying <GET http://jsonip.com/> (failed 2 times): User timeout caused connection failure: Getting http://jsonip.com/ took longer than 2.0 seconds..
2016-09-20 17:48:31 [scrapy] INFO: Removing failed proxy http://xxx.160.162.95:8080, 2 proxies left
2016-09-20 17:48:31 [scrapy] DEBUG: Gave up retrying <GET http://jsonip.com/> (failed 3 times): User timeout caused connection failure: Getting http://jsonip.com/ took longer than 2.0 seconds..

Please help to fix this problem.
thanks a lot

Proxy Error // Start and Stop Time for Requests

When I attempt to request the proxy I get a "KeyError: 'proxy'". Previously, I was able to get the IP address prior to using the proxies. Is there anyway to to get the proxy address that is used.

   def parse_item(self, response):
    	item = {}
    	item['url'] = response.url
    	item['download_latency'] = download_latency = response.request.meta['download_latency']
    	**item['proxy'] = response.request.meta['proxy']**

Separate question from the previous, I was wondering if there was any way to get the start and stop time for a request. I'm trying to get a better understanding of concurrent_requests and how best to maximize request / second.

I want to use databases like mysql to store ips,how should I do?

When run on cloud it return 403 forbidden for every reuest

when i run locally it works fine but i ran it on cloud it says 403 forbidden for every request.

process exception type

process_exception(request, exception, spider)
Scrapy calls process_exception() when a download handler or a process_request() (from a downloader middleware) raises an exception (including an IgnoreRequest exception)

ImportError: No module named scrapy_proxies

I'm getting this error when I run:

Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks
result = g.send(result)
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 90, in crawl
six.reraise(*exc_info)
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 72, in crawl
self.engine = self._create_engine()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 97, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 69, in init
self.downloader = downloader_cls(crawler)
File "/Library/Python/2.7/site-packages/scrapy/core/downloader/init.py", line 88, in init
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/Library/Python/2.7/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/Library/Python/2.7/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/Library/Python/2.7/site-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/init.py", line 37, in import_module
import(name)
ImportError: No module named scrapy_proxies

ValueError: All proxies are unusable, cannot proceed

I'm getting this error:

ValueError: All proxies are unusable, cannot proceed

2017-05-13 14:09:02 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapy_bets)
2017-05-13 14:09:02 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_bets.spiders', 'FEED_URI': 'matches.json', 'SPIDER_MODULES': ['scrapy_bets.spiders'], 'RETRY_TIMES': 10, 'BOT_NAME': 'scrapy_bets', 'RETRY_HTTP_CODES': [500, 503, 504, 400, 403, 404, 408], 'FEED_FORMAT': 'json'}
2017-05-13 14:09:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-05-13 14:09:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy_proxies.RandomProxy',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-13 14:09:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-13 14:09:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-05-13 14:09:02 [scrapy.core.engine] INFO: Spider opened
2017-05-13 14:09:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-13 14:09:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-13 14:09:02 [scrapy.core.scraper] ERROR: Error downloading <GET http://url_to_parse>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
File "/usr/local/lib/python2.7/site-packages/scrapy_proxies/randomproxy.py", line 63, in process_request
raise ValueError('All proxies are unusable, cannot proceed')
ValueError: All proxies are unusable, cannot proceed
2017-05-13 14:09:02 [scrapy.core.scraper] ERROR: Error downloading <GET http://url_to_parse>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
File "/usr/local/lib/python2.7/site-packages/scrapy_proxies/randomproxy.py", line 63, in process_request
raise ValueError('All proxies are unusable, cannot proceed')
ValueError: All proxies are unusable, cannot proceed
2017-05-13 14:09:02 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-13 14:09:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
'downloader/exception_type_count/exceptions.ValueError': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 5, 13, 13, 9, 2, 915138),
'log_count/DEBUG': 1,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2017, 5, 13, 13, 9, 2, 694730)}
2017-05-13 14:09:02 [scrapy.core.engine] INFO: Spider closed (finished)

Proxy API

It's possible use API ProxyList like https://getproxylist.com/#the-api ?

Non auth proxies doesn't work

There's no setting proxy to meta if the proxy url without username and password.

use_real_when_empty':False

Hello, on Reposhub about this I found setting:
'use_real_when_empty':False,
is it works? I've no found function inside..

Error

2018-07-26 10:26:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-26 10:26:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-26 10:26:02 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-07-26 10:26:02 [scrapy.proxies] DEBUG: Using proxy https://185.93.3.70:8080, 1 proxies left
2018-07-26 10:26:03 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://piknu.com/u/isabel_sanzz/similar> (failed 1 times): 403 Forbidden
2018-07-26 10:26:03 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-07-26 10:26:03 [scrapy.proxies] DEBUG: Using proxy https://185.93.3.70:8080, 1 proxies left

There has a problem when the character “@” within password

There has a problem when the character “@” within password, maybe we should make the regex pattern more compatible? :) Here is my solution:

parts = re.match('(\w+://)([^:]+?:.+@)?(.+)', line.strip())

instead of

parts = re.match('(\w+://)([^:]+?:[^@]+?@)?(.+)', line.strip())

when the flie have empty line this code many case error

when the flie have empty line this code many case error ？
if parts.group(2): exceptions.AttributeError: 'NoneType' object has no attribute 'group'
so,before group we can check it? like this

if parts:
    if parts.group(2):
    ...

Can you add it to Pypi?

and what's the license of the code?
Thank you =)

TypeError: memoryview: a bytes-like object is required, not 'str'

Hi, I am getting the error below when using the DOWNLOADER_MIDDLEWARES indicated in the ReadMe (I added a proxy list, etc..). Read a bunch of threads on SO but couldn't fix my issue.

Appreciate any help
thanks

Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/base64.py", line 517, in _input_type_check
m = memoryview(s)
TypeError: memoryview: a bytes-like object is required, not 'str'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
File "/usr/local/lib/python3.6/site-packages/scrapy_proxies/randomproxy.py", line 70, in process_request
basic_auth = 'Basic ' + base64.encodestring(proxy_user_pass)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/base64.py", line 547, in encodestring
return encodebytes(s)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/base64.py", line 534, in encodebytes
_input_type_check(s)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/base64.py", line 520, in _input_type_check
raise TypeError(msg) from err
TypeError: expected bytes-like object, not str
2017-10-16 23:19:20 [scrapy.core.engine] INFO: Closing spider (finished)
2017-10-16 23:19:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/builtins.TypeError': 1,
'finish_reason': 'finished

Not working with scrapy-splash

I'm using scrapy-splash to crawl an ajax site. And when using scrapy-proxies, it seems that the request is not sending through the proxy, the proxy is not working at all.

Proxy error

Does this support shadowsocks

Hi,
buddy,
does this support shadowsocks?

proxylist cant be loaded on Scrapy Cloud

I try several time diferente ways

I add the file "proxylist.txt" in the same folder than setting than the project in addition i upload it to "https://dl.dropboxusercontent.com/s/esdm19mnvz2yguf/proxylist.txt"

I substitute the name in the:
PROXY_LIST = 'https://dl.dropboxusercontent.com/s/esdm19mnvz2yguf/proxylist.txt'
or
PROXY_LIST = 'proxylist.txt'
or
PROXY_LIST = '/proxylist.txt'
PROXY_LIST = '../proxylist.txt'

if i do it like PROXY_LIST = 'proxylist.txt' in my PC, it works like a charm but not once i load it in Scrapy Cloud.

Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 90, in crawl
six.reraise(*exc_info)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 72, in crawl
self.engine = self._create_engine()
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 97, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 69, in init
self.downloader = downloader_cls(crawler)
File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/init.py", line 88, in init
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/usr/local/lib/python2.7/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/usr/local/lib/python2.7/site-packages/scrapy/middleware.py", line 36, in from_settings
mw = mwcls.from_crawler(crawler)
File "/app/python/lib/python2.7/site-packages/scrapy_proxies/randomproxy.py", line 55, in from_crawler
return cls(crawler.settings)
File "/app/python/lib/python2.7/site-packages/scrapy_proxies/randomproxy.py", line 35, in init
fin = open(self.proxy_list)
IOError: [Errno 2] No such file or directory: '../proxylist.txt'

please i need some help.

proxyfile name as an argument

Can I pass the name of the proxyfile as a variable to scrapy?
So if I'm running multiple crawlers at the same time, I would be able to use different list of proxies.

Thank you

Properly formatting ProxyList.txt file

Hey there,

Just looking for some basic info. I'm trying to figure out how to properly build my ProxyList.txt file. I've got the IP addresses from HMA pro but i'm not sure how to locate the port which goes at the end. I've tried searching on google how to find the ports but still not sure. Is there another free service i could use to find the information I need(IP address and port)?

Thanks a ton

I am confused about the non-username-password proxy logic

if proxy_user_pass:
            request.meta['proxy'] = proxy_address
            basic_auth = 'Basic ' + base64.b64encode(proxy_user_pass.encode()).decode()
            request.headers['Proxy-Authorization'] = basic_auth
else:
        log.debug('Proxy user pass not found')
        log.debug('Using proxy <%s>, %d proxies left' % (
        proxy_address, len(self.proxies)))

I am very confused here as a noob python developer. From this part of the logic in randomProxy file. It seems like if the proxy provided in the list.txt is in the http://username:password@host2:port format, then it will work by assigning proxy_address to the request, otherwise, do nothing but logging debug...

What am I missing here?

proxy failed due to {'status': 407, 'reason': b'Proxy Authentication Required'}

Hi i getting this error may time i have used my own custom middleware i have passed proxy like this http://username:[email protected]:12345"

error message
scrapy.core.downloader.handlers.http11.TunnelError:Could not open CONNECT tunnel with proxy 104.120.33.32:12345 [{'status': 407, 'reason': b'Proxy Authentication Required'}]

Response never received errors

Does anyone else experience timeout errors, specifically immediately after redirects?

I've only set this up today, but specifically https://www.game.co.uk/en/hardware/xbox-series-x/?contentOnly=&inStockOnly=true&listerOnly=&pageSize=100

I can fetch it ok with Scrapy fetch, but if I try to use a spider that crawls the URL, I hit a 302 redirect and my crawl just completely errors out from that point with immediate failures "response never received". It's not long timeouts, it's literally just erroring immediately.

Please could somebody help me? I'm fairly new to this and I have no idea what the cause may be.

I'm using a pool of 10 http proxies on port 80

Feature: Proxy mode: change proxy every N-th request

If it's possible, please, add a possibility to change proxy every N-th request.
Add a variable (for setting N) and a new value for "Proxy mode" for this.

How to check failure manually?

Is there any way to check failure through something except HTTP status code?
Maybe based on response body, headers or something else !?

Getting proxy and scraping in one Scrapy project

Hi, Aivars!
I use your Random proxy middleware for Scrapy - scrapy_proxy. It works fine, thank you a lot!

At first, I get list.txt (list of proxies) by scraping free-proxy-site (without proxy rotating)
Then I make scraping of another site, (with scrapy_proxy)
When I run it by two different Scrapy projects it works well.

I tried to run it together in one Scrapy project, unfortunately, it doesn't work. Probably because in this case it tries to use list.txt for proxy rotating which is empty at that moment by request to free-proxy-site.
Is there another way around to handle it?

Thank you

Verify slow crawling

I gave a list of about 300 proxies but set CONCURRENT_REQUESTS = 64. Still it seems that crawling is very slow (like 1 page every few seconds on average), much slower than not using any proxy at all. Of course DOWNLOAD_DELAY is low.

Looking at it, it seems that people should usually also increase CONCURRENT_REQUESTS_PER_DOMAIN in these cases (i.e. with a list of many possibly bad proxies), but even then it's still pretty slow.

Change proxy on http code 429 and dont die

Hi, is it possible to change proxy on http code 429?

If i get 429 error, i want to change to another proxy from the list

So i want run PROXY_MODE = 1 but if i get 429, i want to check/change to new proxy

How to force remove proxy ??

This is not an Issue.
Sometimes, when we open an URL we can get HTTP.RESPONSE<200>
but gives 0 result... or i can say, i got banned from that website.

Is there any way to force remove the proxy item?

Thank you! Any help accepted :)

Dynamic Proxy restart

Is it possible to restart the proxylist if it gets to 0? I have dynamic proxies that refresh every 15 min, so I want scrapy to restart the list if len(self.proxies) == 0.

Thanks!

Passing proxy via meta in start request throws KeyError: 'exception'

I see this is caused by the line no 83.

 if 'proxy' in request.meta:
            if request.meta["exception"] is False:
                return

If we have used proxy in start requests function then this issue arises which makes sense because exception is not defined in meta up to this point for our first request.

I guess most of us either use a random proxy or custom proxy. So no one ever bothered about it.
I think the line 83 is important because it enables to change proxies in each retry or after exception.

def start_requests(self):

        yield scrapy.Request('http://quotes.toscrape.com/', callback=self.parse, meta={'proxy': 'http://xxxx:xxxx@xxxx:xxxx'})

Also to change the proxy in retry. Comment out this in process_exception. #15

       if 'proxy' not in request.meta:
             return