Giter Site home page Giter Site logo

scrapy_ipproxypool's People

Contributors

monkey-soft avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

scrapy_ipproxypool's Issues

由于目标计算机积极拒绝,无法连接

如题。
先贴上两张Log截图
tim 20181009114907
tim 20181009114923

以下是settings.py的所有配置,不确定是我配置的有问题,还是其他地方出了问题,只要运行起来,刚开始可以爬到一些代理IP并存入数据库里,然后就开始自动进入我自己的爬虫程序。


BOT_NAME = 'CommoditySpider'

SPIDER_MODULES = ['CommoditySpider.spiders']
NEWSPIDER_MODULE = 'CommoditySpider.spiders'


USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36'

ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 2

COOKIES_ENABLED = False

DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Content-Type': 'text/html;charset=UTF-8',
    'Cache-Control': 'no-cache',
}

ITEM_PIPELINES = {
    'CommoditySpider.aliexpresslines.pipelines.AliExpressPipeline': 300
}

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 3
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

DOWNLOADER_MIDDLEWARES = {
    # 第二行的填写规则
    #  yourproject.myMiddlewares(文件名).middleware类

    # 设置 User-Agent
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
}

# 默认使用 IP 代理池
if IF_USE_PROXY:
    DOWNLOADER_MIDDLEWARES = {

        # 第二行的填写规则
        #  yourproject.myMiddlewares(文件名).middleware类

        # 设置 User-Agent
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
        'proxyPool.scrapy.RandomUserAgentMiddleware.RandomUserAgentMiddleware': 400,

        # 设置代理
        'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': None,
        'proxyPool.scrapy.middlewares.ProxyMiddleware': 100,

        # 设置自定义捕获异常中间层
        'proxyPool.scrapy.middlewares.CatchExceptionMiddleware': 105,

        # 设置自定义重连中间件
        'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': None,
        'proxyPool.scrapy.middlewares.RetryMiddleware': 95,
    }```

关于文件proxyDBManger.py的问题

proxy = str(data[1], encoding="utf-8").lower() + "://" + str(data[0], encoding="utf-8") + ":" + str(data[2])

这条语句会报错:

TypeError: 'NoneType' object is not subscriptable

请问这个是因为什么呢?该怎么改?

我是一个初学者

我不明白你写的代码为什么与我新建项目生成的文件很不一样。
image
你的东西都是你自己重写的吗?包括引擎都重写了吗?
最关键我按照你所说的运行,一直有错误。
image

working with Scrapy

Hello, I have rewritten your project. But there were some problems about working with Scrapy. So is there any way to connect with you?

出了点问题

得到的ip无法爬取网站,我想要爬取wandoujia,但得到的ip访问时timeout

/Users/icst/Desktop/test_proxy/wandoujia/proxyPool/ProxyPoolWorker.py:81: SyntaxWarning: "is not" with a literal. Did you mean "!="?
if proxy is not '':
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymysql/cursors.py:170: Warning: (1681, b'Integer display width is deprecated and will be removed in a future release.')
result = self._query(query)
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymysql/cursors.py:170: Warning: (3719, b"'utf8' is currently an alias for the character set UTF8MB3, but will be an alias for UTF8MB4 in a future release. Please consider using UTF8MB4 in order to be unambiguous.")
result = self._query(query)
正在爬取快代理……
115.216.56.92 | 9999 | 高匿名 | HTTP | 浙江省杭州市 电信 | 3秒
123.149.136.127 | 9999 | 高匿名 | HTTP | 河南省洛阳市 电信 | 1秒
111.72.25.153 | 9999 | 高匿名 | HTTP | 江西省抚州市 电信 | 0.5秒
183.166.111.11 | 9999 | 高匿名 | HTTP | 安徽省淮南市 电信 | 2秒
171.35.211.234 | 9999 | 高匿名 | HTTP | 江西省新余市 联通 | 3秒
114.239.110.93 | 9999 | 高匿名 | HTTP | 江苏省宿迁市 电信 | 2秒
110.243.2.58 | 9999 | 高匿名 | HTTP | 河北省唐山市 联通 | 2秒
114.99.22.104 | 9999 | 高匿名 | HTTP | 安徽省铜陵市 电信 | 2秒
124.113.250.171 | 9999 | 高匿名 | HTTP | 安徽省宿州市 电信 | 3秒
123.149.141.209 | 9999 | 高匿名 | HTTP | 河南省洛阳市 电信 | 1秒
183.146.156.254 | 9999 | 高匿名 | HTTP | 浙江省金华市 电信 | 0.7秒
123.149.136.121 | 9999 | 高匿名 | HTTP | 河南省洛阳市 电信 | 3秒
163.204.247.139 | 9999 | 高匿名 | HTTP | 广东省汕尾市 联通 | 1秒
123.163.27.220 | 9999 | 高匿名 | HTTP | 河南省洛阳市 电信 | 0.8秒
1.196.177.218 | 9999 | 高匿名 | HTTP | 河南省洛阳市 电信 | 0.7秒
2020-02-09 23:15:11 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: wandoujia)
2020-02-09 23:15:11 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.8.1 (v3.8.1:1b293b6006, Dec 18 2019, 14:08:53) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform macOS-10.14.1-x86_64-i386-64bit
2020-02-09 23:15:11 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'wandoujia', 'COOKIES_ENABLED': False, 'NEWSPIDER_MODULE': 'wandoujia.spiders', 'SPIDER_MODULES': ['wandoujia.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
2020-02-09 23:15:11 [scrapy.extensions.telnet] INFO: Telnet Password: 79f3a3cb43e725d1
2020-02-09 23:15:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-02-09 23:15:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['proxyPool.scrapy.middlewares.RetryMiddleware',
'proxyPool.scrapy.middlewares.ProxyMiddleware',
'proxyPool.scrapy.middlewares.CatchExceptionMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'proxyPool.scrapy.RandomUserAgentMiddleware.RandomUserAgentMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'wandoujia.middlewares.WandoujiaDownloaderMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-02-09 23:15:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-02-09 23:15:11 [scrapy.middleware] INFO: Enabled item pipelines:
['wandoujia.pipelines.MyFilesPipeline']
2020-02-09 23:15:11 [scrapy.core.engine] INFO: Spider opened
2020-02-09 23:15:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-09 23:15:11 [main] INFO: Spider opened: main
2020-02-09 23:15:11 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-02-09 23:15:11 [root] DEBUG: ===== ProxyMiddleware get a random_proxy:【 http://123.149.136.121:9999 】 =====
2020-02-09 23:16:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-09 23:17:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-09 23:18:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-09 23:18:11 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.wandoujia.com/apps/665777> (failed 1 times): User timeout caused connection failure: Getting https://www.wandoujia.com/apps/665777 took longer than 180.0 seconds..
2020-02-09 23:18:11 [root] DEBUG: ===== ProxyMiddleware get a random_proxy:【 http://110.243.2.58:9999 】 =====
2020-02-09 23:19:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-09 23:19:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.wandoujia.com/apps/665777> (failed 2 times): TCP connection timed out: 60: Operation timed out.
2020-02-09 23:19:27 [root] DEBUG: ===== ProxyMiddleware get a random_proxy:【 http://1.196.177.218:9999 】 =====
2020-02-09 23:19:27 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.wandoujia.com/apps/665777> (failed 3 times): Connection was refused by other side: 61: Connection refused.
2020-02-09 23:19:27 [root] DEBUG: === success to update 1.196.177.218 proxy ===
2020-02-09 23:19:27 [root] DEBUG: === success to update 1.196.177.218 proxy ===
2020-02-09 23:19:27 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.wandoujia.com/apps/665777>
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
defer.returnValue((yield download_func(request=request, spider=spider)))
twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 61: Connection refused.
2020-02-09 23:19:27 [scrapy.core.engine] INFO: Closing spider (finished)
2020-02-09 23:19:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 1,
'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 1,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 1,
'downloader/request_bytes': 918,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'elapsed_time_seconds': 256.041098,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 2, 9, 15, 19, 27, 373921),
'log_count/DEBUG': 8,
'log_count/ERROR': 1,
'log_count/INFO': 15,
'memusage/max': 67170304,
'memusage/startup': 66805760,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/twisted.internet.error.TCPTimedOutError': 1,
'retry/reason_count/twisted.internet.error.TimeoutError': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2020, 2, 9, 15, 15, 11, 332823)}
2020-02-09 23:19:27 [scrapy.core.engine] INFO: Spider closed (finished)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.