Giter Site home page Giter Site logo

Comments (10)

lllllllai27 avatar lllllllai27 commented on August 14, 2024 3

淘宝主要通过window.navigater.webdriver的属性值来识别,与navigator的其他属性无关,所以直接利用execute_script注入js把window.navigater.webdriver设置为false就可以了

我参考网上的资料
https://www.kebook.cn/9329/
加了这一行代码:
browser.execute_script("Object.defineProperties(navigator,{webdriver:{get:() => false}})")
还是不能爬取,执行程序的时候,会跳转到登录页面,然后建议用fiddler代理来替换js,这个就不知道怎么搞了,希望作者可以重新写一下这个代码,学习一下

from taobaoproduct.

Jenkin7 avatar Jenkin7 commented on August 14, 2024 1

我也遇到了同样的问题,原因可能是淘宝识别到了selenium的webdrive,然后进行了反爬虫,希望有人可以分享一下如何爬取淘宝商品数据

from taobaoproduct.

Germey avatar Germey commented on August 14, 2024 1

淘宝主要通过window.navigater.webdriver的属性值来识别,与navigator的其他属性无关,所以直接利用execute_script注入js把window.navigater.webdriver设置为false就可以了

from taobaoproduct.

pacluoluo avatar pacluoluo commented on August 14, 2024

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from urllib.parse import quote
from pyquery import PyQuery as pq
import pymongo

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
browser = webdriver.Chrome(chrome_options=chrome_options)
wait = WebDriverWait(browser,10)
KEYWORD = 'ipad'

def index_page(page):
"""
抓取索引页
:param page:页码
"""
print('正在爬取第',page,'页')
try:
url = 'https://s.taobao.com/search?q=' + quote(KEYWORD)
browser.get(url)
if page > 1:
input = wait.until(
EC.presence_of_element_located((By.CSS_SELECTOR,'#mainsrp-pager div.form > input ')))
submit = wait.until(
EC.element_to_be_clickable((By.CSS_SELECTOR,'#mainsrp-pager div.form > span.btn.J_Submit')))
input.clear()
input.send_keys(page)
submit.click()
wait.until(
EC.text_to_be_present_in_element((By.CSS_SELECTOR,'#mainsrp-pager li.item.active > span'),str(page)))
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'.m-itemlist .items .item')))
get_products()
except TimeoutException:
index_page(page)

def get_products():
"""
提取商品数据
"""
html = browser.page_source
doc = pq(html)
items = doc('#mainsrp.itemlist .items .item').items()
for item in items:
product ={
'image':item.find('.pic .img').attr('data-src'),
'price':item.find('price').text(),
'deal':item.find('.deal-cnt').text(),
'title':item.find('.title').text(),
'shop':item.find('.shop').text(),
'location':item.find('location').text()
}
print(product)
save_to_mongo(product)

MONGO_URL ='localhost'
MONGO_DB ='taobao'
MONGO_COLLECTION ='products'
client = pymongo.MongoClient(MONGO_URL)
db = client[MONGO_DB]
def save_to_mongo(result):
"""
保存至MongoDB
:param result:结果
"""
try:
if db[MONGO_COLLECTION].insert(result):
print('存储到MongoDB成功')
except Exception:
print('存储到MongoDB失败')

MAX_PAGE = 2
def main():
"""
遍列每一页
"""
for i in range(1, MAX_PAGE + 1):
index_page(i)

if name == 'main':
main()

源码不知道为什么34,35行老是报上面的错误

from taobaoproduct.

clamyang avatar clamyang commented on August 14, 2024

看了代码,查找条件没错。你34行 > 后边不要加换行试以下吧。

from taobaoproduct.

CHN2017 avatar CHN2017 commented on August 14, 2024

it's a terrible problem for new. could author solve this issue?

from taobaoproduct.

wjx1018960145 avatar wjx1018960145 commented on August 14, 2024

爬取时停在登录页面 怎么搞

from taobaoproduct.

Hfywtias avatar Hfywtias commented on August 14, 2024

爬取时停在登录页面 怎么搞

加一

from taobaoproduct.

wjx1018960145 avatar wjx1018960145 commented on August 14, 2024

from taobaoproduct.

Germey avatar Germey commented on August 14, 2024

https://mp.weixin.qq.com/s/Iz-DY1UrSfVFRFh5CyHl3Q 也可参考

from taobaoproduct.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.