Giter Site home page Giter Site logo

taobaoproduct's Introduction

TaobaoProduct

Selenium Demo of Taobao Product

2020/2/10 更新,关于需要登录的问题,见 Issue:#15

taobaoproduct's People

Contributors

germey avatar yilouwangye avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

taobaoproduct's Issues

已经扫码登陆成功,但是在爬取10页之后,淘宝会跳出 滑块访问验证

已经扫码登陆成功,但是在爬取10页之后,淘宝会跳出 滑块访问验证。请问有遇到这个问题的小伙伴是如何解决的。

image

我在selenium打开的页面里拖动了滑块,但是没有效果,显示验证失败。

image

有没有可以绕过滑块检测的方法,比如每爬取1页就休息5s。尽量不触发淘宝的反爬虫机制

请问下这什么情况

[13944:14208:0409/001345.567:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001346.349:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001346.967:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001347.503:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001348.176:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001348.774:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001349.372:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001349.942:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001350.609:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001351.216:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED

请问下这什么情况

[13944:14208:0409/001345.567:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001346.349:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001346.967:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001347.503:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001348.176:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001348.774:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001349.372:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001349.942:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001350.609:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001351.216:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED

No module named 'config'

from config import *

ModuleNotFoundError: No module named 'config'

这个是哪个库没安装好吗?

麻烦请教一下,34行不知道为什么会报错。上网找这报错原因没能找到。

/usr/local/bin/python3 /Users/duoluoluolin/PycharmProjects/untitled/随便写写/动态渲染页面爬取/selenium_实战.py
正在爬取第 1 页
正在爬取第 1 页
Traceback (most recent call last):
File "/Users/duoluoluolin/PycharmProjects/untitled/随便写写/动态渲染页面爬取/selenium_实战.py", line 34, in index_page
EC.text_to_be_present_in_element((By.CSS_SELECTOR,'#mainsrp-pager li.item.active > span'),str(page)))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

代码不完整

时间2020年11月29日,代码应该是经过几次修改的,不够完整,倒不建议新手去看,但里面的爬虫逻辑和思维值得学习

无需登陆的其他尝试,页码输入框的页码值已经改为value属性的值,需要模拟JavaScript修改。

方式1:控制已经打开的浏览器

打开cmd,输入chrome.exe --remote-debugging-port=9222 --user-data-dir="C:\selenum\AutomationProfile"
作用是打开Chrome,指定远程调试端口,设置一个新的配置目录,不影响源目录

使用这个新开的浏览器访问淘宝,登陆后搜索商品
然后运行以下代码(chromedriver放在/Python/Script目录下)

from selenium.webdriver.chrome.options import Options
from selenium import webdriver

# 初始化浏览器选项对象
chrome_options = Options()
# 添加拓展选项,指定调试端口
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
# driver = webdriver.Chrome(chrome_options=chrome_options)
# selenium3.141.0   python3.8.1出现弃用警告
# DeprecationWarning: use options instead of chrome_options
# 将chrome_options换成option
driver = webdriver.Chrome(options=chrome_options)
print(driver.title)

方式2:加载所有Chrome配置

提前访问淘宝,登陆并搜索商品,让浏览器保存相关信息

from selenium import webdriver
option = webdriver.ChromeOptions()
p=r'C:\Users\Administrator\AppData\Local\Google\Chrome\User Data'
option.add_argument('--user-data-dir='+p)  # 设置成用户自己的数据目录
driver = webdriver.Chrome(chrome_options=option)

页码输入框的页码值已经改为value属性的值,需要模拟JavaScript修改。

        if page > 1:
            input = wait.until(
                EC.presence_of_element_located((By.CLASS_NAME, 'J_Input'))
            )
            submit = wait.until(
                EC.element_to_be_clickable((By.CLASS_NAME, 'J_Submit'))
            )
            # 模拟JavaScript修改结点的属性值
            browser.execute_script("arguments[0].setAttribute('value','{}');".format(str(page)), input)
            submit.click()

图片链接没有http头

得到的图片链接没有http头 'image': '//g-search1.alicdn.com/img/bao/uploaded/i4/i1/2244854683/O1CN011kSrF0E2Qti40FF_!!2244854683.jpg',修改一下 image': "https:" + item.find('.pic .img').attr('data-src')

关于初始化selenium并登陆淘宝的一些尝试

学习崔老师本节内容并尝试后发现,现今淘宝对于selenium的反爬机制做的非常到位。目前对于淘宝登录过程中常出现的滑动验证码的研究,最简单的有更改window.navigator.webdriver的值为undefined ,即以开发者模式打开chrome,本人尝试后发现无效。另外,调用ActionChains模块去完成滑动操作经常失败,甚至之后本人人工操作仍经常失败。更多讨论见如下链接: https://www.zhihu.com/question/285659525?sort=created。 这里给出一个比较简易的登录方式,即选择微博登录绕开可能存在滑动验证码的情况,也期待各位大牛能给出解决滑动验证码的思路!

try:
    chrome_options = webdriver.ChromeOptions()
    #chrome_options.add_argument('--headless')
    # 下一行代码是为了以开发者模式打开chrome
    chrome_options.add_experimental_option('excludeSwitches',['enable-automation'])
    browser = webdriver.Chrome(options=chrome_options)
    browser.get("https://s.taobao.com/search?q=iPad")
    button = browser.find_element_by_class_name('login-switch')
    button.click()
    button = browser.find_element_by_class_name('weibo-login')
    button.click()
    user_name = browser.find_element_by_name('username')
    user_name.clear()
    user_name.send_keys('*****') #输入微博名 需要事先绑定淘宝
    time.sleep(1)
    user_keys = browser.find_element_by_name('password')
    user_keys.clear()
    user_keys.send_keys('*****') #输入微博密码
    time.sleep(1)
    button = browser.find_element_by_class_name('W_btn_g')
    button.click()
    time.sleep(1)
    cookies = browser.get_cookies()
    ses=requests.Session() # 维持登录状态
    c = requests.cookies.RequestsCookieJar()
    for item in cookies:
        c.set(item["name"],item["value"])
        ses.cookies.update(c)
        ses=requests.Session()
        time.sleep(1)
    print('登录成功')
except:
    print("登录失败")

如何通过浏览器开发者选项找到所需信息的标签?

shop,location,deal都是和书上一样定位条件,但image,price,title三个信息的定位条件不一致。自己写的定位条件如下:

  • 'image': item.find('.J_ItemPic').attr('data-src'),

  • 'price': item.find('.price strong').text()

  • 'title': item.find('.baoyou').text()

求大佬们指点下如何准确快速找到所需信息的标签

现在爬取淘宝商品信息必须登录,修改代码先扫码登录再爬取

#爬取淘宝商品
import pymongo
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from pyquery import PyQuery as pq
from urllib.parse import quote

# browser = webdriver.Chrome()
# browser = webdriver.PhantomJS(service_args=SERVICE_ARGS)
MONGO_URL = 'localhost'
MONGO_DB = 'taobao'
MONGO_COLLECTION = 'products'

KEYWORD = 'macbook'
MAX_PAGE = 20


options = webdriver.ChromeOptions()
#chrome_options.add_argument('--headless')
browser = webdriver.Chrome(options=options)

wait = WebDriverWait(browser, 10)
client = pymongo.MongoClient(MONGO_URL)
db = client[MONGO_DB]


def index_page(page):
    """
    抓取索引页
    :param page: 页码
    """
    print('正在爬取第', page, '页')
    try:
        url = 'https://s.taobao.com/search?q=' + quote(KEYWORD)
        browser.get(url)
        #下面if里面的代码实现页面跳转
        if page > 1:
            input = wait.until(
                EC.presence_of_element_located((By.CSS_SELECTOR, '#mainsrp-pager div.form > input')))
            submit = wait.until(
                EC.element_to_be_clickable((By.CSS_SELECTOR, '#mainsrp-pager div.form > span.btn.J_Submit')))
            input.clear()
            input.send_keys(page)
            submit.click()
        #等待跳转成功
        wait.until(
            EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#mainsrp-pager li.item.active > span'), str(page)))
        #等待商品加载出来
        wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.m-itemlist .items .item')))
        get_products()
    except TimeoutException:
        index_page(page)


def get_products():
    """
    提取商品数据
    """
    html = browser.page_source
    doc = pq(html)
    items = doc('#mainsrp-itemlist .m-itemlist .items .item').items()
    for item in items:
        product = {
            'image': 'https:'+item.find('.pic .img').attr('data-src'),
            'price': item.find('.price').text(),
            'deal': item.find('.deal-cnt').text(),
            'title': item.find('.title').text(),
            'shop': item.find('.shop').text(),
            'location': item.find('.location').text()
        }
        print(product)
        save_to_mongo(product)


def save_to_mongo(result):
    """
    保存至MongoDB
    :param result: 结果
    """
    try:
        if db[MONGO_COLLECTION].insert_one(result):
            print('存储到MongoDB成功')
    except Exception:
        print('存储到MongoDB失败')


def main():
    """
    遍历每一页
    """
    for i in range(1, MAX_PAGE + 1):
        index_page(i)
    browser.close()


if __name__ == '__main__':
    main()

关键修改在这里,将浏览器显示出来,先登录再爬

options = webdriver.ChromeOptions()
#chrome_options.add_argument('--headless')
browser = webdriver.Chrome(options=options)

另外将config.py中的内容合并到这里。

关于一直在正在爬取第 1 页

也不报错,爬了很长时间还是 正在爬取第 1 页 因为我只想把数据打印出来,所以把数据库的存储方法注释掉了

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.