k1995 / baiduyunspider Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 481.0 2.9 MB

百度云网盘搜索引擎，包含爬虫 & 网站

Python 3.04% HTML 0.67% JavaScript 96.29%

python spider

baiduyunspider's People

Contributors

Stargazers

Watchers

Forkers

prodigyu jzxyouok resdht huguge hex55 peterdocter geekcheng hsmw w3info smallsong mengyou658 webgamelinux sobeautiy hackzhaowei moecao zxgrand chongzi0307 mapns zenithght xiaoyuanw jackyjoe xiaosimao yongandroid fanreson tdautc19841202 990375135 ckwsens ahaharry cheng4ping ifond hellopyj nuet itiki hifuck sdwzzx sculzx007 motechs2016 vijaygod wuze yxc6123 leeseean xunux flyzhang007 leopardpan araycn seekergalahad ywang2014 jasmine3happy qingsing wangroot lovexj88 wangyx0055 xingganfengxing zhuio hts7 fashtimedotcom jinjin123 lyndon1115 zitazhai ynzheng lonlonelong x12311231 hqb421 zhaoxianjin 15863004186sunchi kaiseryz jacktian wfc1870 ossoen mol310 yixiaoqingyuz fangj99 xutongle huokedu yonghua4413 tulei2006 crawlerhome hiekay williambilly xiaoyaocloud yankaics nerozhang ejoful 429809521 hi0x0 rovercyh zyjj bluescharp equalll gfa99 aionliao mx86558853 taoyunaming jackbuh leuher s0x06 tzhsweet yoububedu hardywalker zhouwhao

baiduyunspider's Issues

爬虫代理ip，报错

我对代码进行了改造，使用了代理ip但是仍然报错：

uk:2518160999 error to fetch files,try again later

getShareLists errno:-55

代码如下：
def getHtml(url,ref=None,reget=5):
try:
proxies={'http': '222.194.14.130:808'}
proxy_support = urllib2.ProxyHandler(proxies)
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)
#定义Opener
# urllib2.install_opener(opener)
request = urllib2.Request(url)
request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36')
if ref:
request.add_header('Referer',ref)
page = urllib2.urlopen(request,timeout=10)
html = page.read()
except:
if reget>=1:
#如果getHtml失败，则再次尝试5次
print 'getHtml error,reget...%d'%(6-reget)
time.sleep(2)
return getHtml(url,ref,reget-1)
else:
print 'request url:'+url
print 'failed to fetch html'
exit()
else:
return html

1 errno=-55；这个是什么造成的，我的爬虫，现在一直被返回这个错误码

errno=-55；这个是什么造成的，我的爬虫，现在一直被返回这个错误码。能否给我一份大概带注释的爬虫脚本，我自己可以修改下，想减少下弯路，我是Python小白。谢了

如果指定关键字进行爬取

你好，我想根据指定关键字爬取数据，怎么处理

11

bug

参考着我也写了一个百度云搜索 www.81ad.cn

我也写了一个百度云搜索 www.81ad.cn 没放广告,调用百度内部接口，现在已经有3千多万数据了

网站内页优化

可否增加些网站页面或者加上相关关键词搜索来优化下内页之间的关联性

没有关于搜索引擎的操作源码吗

想了解这个搜索引擎是怎么操作的

爬虫做种的时候报错

success to fetched hot users: 24
Traceback (most recent call last):
File "spider.py", line 475, in
spider.seedUsers()
File "spider.py", line 328, in seedUsers
self.db.commit()
File "spider.py", line 101, in commit
self.dbconn.commit()
AttributeError: 'NoneType' object has no attribute 'commit'

请问有没有什么解决方法呢？
操作系统是用的 Centos 7X64
Python版本是：2.7.5

如何联系你

怎么联系你

厉害~

大四就能写爬虫了，请收下膝盖~

redis.exceptions.ConnectionError: Error 10061 connecting to 127.0.0.1:6379.

按照你的步骤，执行。。是不是缺少了什么，
scrapy crawl baidupan 执行这个命令是一直报这个错

404 Not Found

当我发出搜索请求时，显示的请求链接如下
http://mydomain/s/57un55S15L%2Bd5oqk?from=sf&type=all
可是，在nginx中，每次都是在显示404错误，找不到页面，
这个问题困扰几天了，仍然没解决，
请大佬帮忙解答

我想問下你python用的哪個版本呢？

运行scrapy crawl baidupan报错，请问应该怎么解决呢？

File "c:\users\administrator.win-a3unjobi233\appdata\local\programs\python\python38\lib\site-packages\scrapy\crawler.py", line 89, in crawl
yield self.engine.open_spider(self.spider, start_requests)
redis.exceptions.ConnectionError: Error 10061 connecting to 127.0.0.1:6379. 由于目标计算机积极拒绝，无法连接。.

2021-02-01 10:12:28 [twisted] CRITICAL:
Traceback (most recent call last):
File "c:\users\administrator.win-a3unjobi233\appdata\local\programs\python\python38\lib\site-packages\redis\connection.py", line 559, in connect
sock = self._connect()
File "c:\users\administrator.win-a3unjobi233\appdata\local\programs\python\python38\lib\site-packages\redis\connection.py", line 615, in _connect
raise err
File "c:\users\administrator.win-a3unjobi233\appdata\local\programs\python\python38\lib\site-packages\redis\connection.py", line 603, in _connect
sock.connect(socket_address)
ConnectionRefusedError: [WinError 10061] 由于目标计算机积极拒绝，无法连接。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "c:\users\administrator.win-a3unjobi233\appdata\local\programs\python\python38\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "c:\users\administrator.win-a3unjobi233\appdata\local\programs\python\python38\lib\site-packages\scrapy\crawler.py", line 89, in crawl
yield self.engine.open_spider(self.spider, start_requests)
redis.exceptions.ConnectionError: Error 10061 connecting to 127.0.0.1:6379. 由于目标计算机积极拒绝，无法连接。.