Comments (10)
淘宝主要通过window.navigater.webdriver的属性值来识别,与navigator的其他属性无关,所以直接利用execute_script注入js把window.navigater.webdriver设置为false就可以了
我参考网上的资料
https://www.kebook.cn/9329/
加了这一行代码:
browser.execute_script("Object.defineProperties(navigator,{webdriver:{get:() => false}})")
还是不能爬取,执行程序的时候,会跳转到登录页面,然后建议用fiddler代理来替换js,这个就不知道怎么搞了,希望作者可以重新写一下这个代码,学习一下
from taobaoproduct.
我也遇到了同样的问题,原因可能是淘宝识别到了selenium的webdrive,然后进行了反爬虫,希望有人可以分享一下如何爬取淘宝商品数据
from taobaoproduct.
淘宝主要通过window.navigater.webdriver的属性值来识别,与navigator的其他属性无关,所以直接利用execute_script注入js把window.navigater.webdriver设置为false就可以了
from taobaoproduct.
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from urllib.parse import quote
from pyquery import PyQuery as pq
import pymongo
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
browser = webdriver.Chrome(chrome_options=chrome_options)
wait = WebDriverWait(browser,10)
KEYWORD = 'ipad'
def index_page(page):
"""
抓取索引页
:param page:页码
"""
print('正在爬取第',page,'页')
try:
url = 'https://s.taobao.com/search?q=' + quote(KEYWORD)
browser.get(url)
if page > 1:
input = wait.until(
EC.presence_of_element_located((By.CSS_SELECTOR,'#mainsrp-pager div.form > input ')))
submit = wait.until(
EC.element_to_be_clickable((By.CSS_SELECTOR,'#mainsrp-pager div.form > span.btn.J_Submit')))
input.clear()
input.send_keys(page)
submit.click()
wait.until(
EC.text_to_be_present_in_element((By.CSS_SELECTOR,'#mainsrp-pager li.item.active > span'),str(page)))
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'.m-itemlist .items .item')))
get_products()
except TimeoutException:
index_page(page)
def get_products():
"""
提取商品数据
"""
html = browser.page_source
doc = pq(html)
items = doc('#mainsrp.itemlist .items .item').items()
for item in items:
product ={
'image':item.find('.pic .img').attr('data-src'),
'price':item.find('price').text(),
'deal':item.find('.deal-cnt').text(),
'title':item.find('.title').text(),
'shop':item.find('.shop').text(),
'location':item.find('location').text()
}
print(product)
save_to_mongo(product)
MONGO_URL ='localhost'
MONGO_DB ='taobao'
MONGO_COLLECTION ='products'
client = pymongo.MongoClient(MONGO_URL)
db = client[MONGO_DB]
def save_to_mongo(result):
"""
保存至MongoDB
:param result:结果
"""
try:
if db[MONGO_COLLECTION].insert(result):
print('存储到MongoDB成功')
except Exception:
print('存储到MongoDB失败')
MAX_PAGE = 2
def main():
"""
遍列每一页
"""
for i in range(1, MAX_PAGE + 1):
index_page(i)
if name == 'main':
main()
源码不知道为什么34,35行老是报上面的错误
from taobaoproduct.
看了代码,查找条件没错。你34行 > 后边不要加换行试以下吧。
from taobaoproduct.
it's a terrible problem for new. could author solve this issue?
from taobaoproduct.
爬取时停在登录页面 怎么搞
from taobaoproduct.
爬取时停在登录页面 怎么搞
加一
from taobaoproduct.
from taobaoproduct.
https://mp.weixin.qq.com/s/Iz-DY1UrSfVFRFh5CyHl3Q 也可参考
from taobaoproduct.
Related Issues (19)
- 之前能爬成功,但是后来几次就直接跳转到登录界面了 HOT 3
- 请问下这什么情况 HOT 2
- 整合了下代码,实现了自动登陆,并将图片保存到本地 HOT 3
- 如何通过浏览器开发者选项找到所需信息的标签? HOT 2
- 已经扫码登陆成功,但是在爬取10页之后,淘宝会跳出 滑块访问验证 HOT 1
- 新版本的selenium不支持Phantomjs了 HOT 5
- 关于初始化selenium并登陆淘宝的一些尝试 HOT 10
- 一直爬第一页,设置的断点发现此语句无法往下走 HOT 1
- browser.pagesource有问题如何解决 HOT 1
- 无需登陆的其他尝试,页码输入框的页码值已经改为value属性的值,需要模拟JavaScript修改。 HOT 4
- 代码不完整
- 关于一直在正在爬取第 1 页 HOT 4
- 总是显示一直在爬第1页,停不下来。什么问题? HOT 4
- 为啥我这个淘宝网上没显示那个xhr文件 HOT 4
- 现在爬取淘宝商品信息必须登录,修改代码先扫码登录再爬取 HOT 11
- 图片链接没有http头
- No module named 'config' HOT 3
- 请问下这什么情况
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from taobaoproduct.