Selenium Demo of Taobao Product
2020/2/10 更新,关于需要登录的问题,见 Issue:#15
Taobao Product Spider by Selenium
Selenium Demo of Taobao Product
2020/2/10 更新,关于需要登录的问题,见 Issue:#15
[13944:14208:0409/001345.567:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001346.349:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001346.967:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001347.503:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001348.176:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001348.774:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001349.372:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001349.942:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001350.609:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001351.216:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001345.567:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001346.349:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001346.967:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001347.503:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001348.176:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001348.774:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001349.372:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001349.942:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001350.609:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
[13944:14208:0409/001351.216:ERROR:platform_sensor_reader_win.cc(243)] NOT IMPLEMENTED
从搜索界面被重定向到登录界面
https://login.taobao.com/member/login.jhtml?redirectURL=http://s.taobao.com/search?q=iphone
请问是需要做什么处理才能避免这种情况呢?
from config import *
ModuleNotFoundError: No module named 'config'
这个是哪个库没安装好吗?
淘宝的反爬虫机制也改了,现在用selenium只能爬第一页,翻页的时候会被检测出来,禁止操作
/usr/local/bin/python3 /Users/duoluoluolin/PycharmProjects/untitled/随便写写/动态渲染页面爬取/selenium_实战.py
正在爬取第 1 页
正在爬取第 1 页
Traceback (most recent call last):
File "/Users/duoluoluolin/PycharmProjects/untitled/随便写写/动态渲染页面爬取/selenium_实战.py", line 34, in index_page
EC.text_to_be_present_in_element((By.CSS_SELECTOR,'#mainsrp-pager li.item.active > span'),str(page)))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
时间2020年11月29日,代码应该是经过几次修改的,不够完整,倒不建议新手去看,但里面的爬虫逻辑和思维值得学习
打开cmd,输入chrome.exe --remote-debugging-port=9222 --user-data-dir="C:\selenum\AutomationProfile"
作用是打开Chrome,指定远程调试端口,设置一个新的配置目录,不影响源目录
使用这个新开的浏览器访问淘宝,登陆后搜索商品
然后运行以下代码(chromedriver放在/Python/Script目录下)
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
# 初始化浏览器选项对象
chrome_options = Options()
# 添加拓展选项,指定调试端口
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
# driver = webdriver.Chrome(chrome_options=chrome_options)
# selenium3.141.0 python3.8.1出现弃用警告
# DeprecationWarning: use options instead of chrome_options
# 将chrome_options换成option
driver = webdriver.Chrome(options=chrome_options)
print(driver.title)
提前访问淘宝,登陆并搜索商品,让浏览器保存相关信息
from selenium import webdriver
option = webdriver.ChromeOptions()
p=r'C:\Users\Administrator\AppData\Local\Google\Chrome\User Data'
option.add_argument('--user-data-dir='+p) # 设置成用户自己的数据目录
driver = webdriver.Chrome(chrome_options=option)
if page > 1:
input = wait.until(
EC.presence_of_element_located((By.CLASS_NAME, 'J_Input'))
)
submit = wait.until(
EC.element_to_be_clickable((By.CLASS_NAME, 'J_Submit'))
)
# 模拟JavaScript修改结点的属性值
browser.execute_script("arguments[0].setAttribute('value','{}');".format(str(page)), input)
submit.click()
得到的图片链接没有http头 'image': '//g-search1.alicdn.com/img/bao/uploaded/i4/i1/2244854683/O1CN011kSrF0E2Qti40FF_!!2244854683.jpg',修改一下 image': "https:" + item.find('.pic .img').attr('data-src')
总是显示一直正在爬第1页,停不下来。什么问题?
学习崔老师本节内容并尝试后发现,现今淘宝对于selenium的反爬机制做的非常到位。目前对于淘宝登录过程中常出现的滑动验证码的研究,最简单的有更改window.navigator.webdriver的值为undefined ,即以开发者模式打开chrome,本人尝试后发现无效。另外,调用ActionChains模块去完成滑动操作经常失败,甚至之后本人人工操作仍经常失败。更多讨论见如下链接: https://www.zhihu.com/question/285659525?sort=created。 这里给出一个比较简易的登录方式,即选择微博登录绕开可能存在滑动验证码的情况,也期待各位大牛能给出解决滑动验证码的思路!
try:
chrome_options = webdriver.ChromeOptions()
#chrome_options.add_argument('--headless')
# 下一行代码是为了以开发者模式打开chrome
chrome_options.add_experimental_option('excludeSwitches',['enable-automation'])
browser = webdriver.Chrome(options=chrome_options)
browser.get("https://s.taobao.com/search?q=iPad")
button = browser.find_element_by_class_name('login-switch')
button.click()
button = browser.find_element_by_class_name('weibo-login')
button.click()
user_name = browser.find_element_by_name('username')
user_name.clear()
user_name.send_keys('*****') #输入微博名 需要事先绑定淘宝
time.sleep(1)
user_keys = browser.find_element_by_name('password')
user_keys.clear()
user_keys.send_keys('*****') #输入微博密码
time.sleep(1)
button = browser.find_element_by_class_name('W_btn_g')
button.click()
time.sleep(1)
cookies = browser.get_cookies()
ses=requests.Session() # 维持登录状态
c = requests.cookies.RequestsCookieJar()
for item in cookies:
c.set(item["name"],item["value"])
ses.cookies.update(c)
ses=requests.Session()
time.sleep(1)
print('登录成功')
except:
print("登录失败")
shop,location,deal都是和书上一样定位条件,但image,price,title三个信息的定位条件不一致。自己写的定位条件如下:
'image': item.find('.J_ItemPic').attr('data-src'),
'price': item.find('.price strong').text()
'title': item.find('.baoyou').text()
求大佬们指点下如何准确快速找到所需信息的标签
#爬取淘宝商品
import pymongo
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from pyquery import PyQuery as pq
from urllib.parse import quote
# browser = webdriver.Chrome()
# browser = webdriver.PhantomJS(service_args=SERVICE_ARGS)
MONGO_URL = 'localhost'
MONGO_DB = 'taobao'
MONGO_COLLECTION = 'products'
KEYWORD = 'macbook'
MAX_PAGE = 20
options = webdriver.ChromeOptions()
#chrome_options.add_argument('--headless')
browser = webdriver.Chrome(options=options)
wait = WebDriverWait(browser, 10)
client = pymongo.MongoClient(MONGO_URL)
db = client[MONGO_DB]
def index_page(page):
"""
抓取索引页
:param page: 页码
"""
print('正在爬取第', page, '页')
try:
url = 'https://s.taobao.com/search?q=' + quote(KEYWORD)
browser.get(url)
#下面if里面的代码实现页面跳转
if page > 1:
input = wait.until(
EC.presence_of_element_located((By.CSS_SELECTOR, '#mainsrp-pager div.form > input')))
submit = wait.until(
EC.element_to_be_clickable((By.CSS_SELECTOR, '#mainsrp-pager div.form > span.btn.J_Submit')))
input.clear()
input.send_keys(page)
submit.click()
#等待跳转成功
wait.until(
EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#mainsrp-pager li.item.active > span'), str(page)))
#等待商品加载出来
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.m-itemlist .items .item')))
get_products()
except TimeoutException:
index_page(page)
def get_products():
"""
提取商品数据
"""
html = browser.page_source
doc = pq(html)
items = doc('#mainsrp-itemlist .m-itemlist .items .item').items()
for item in items:
product = {
'image': 'https:'+item.find('.pic .img').attr('data-src'),
'price': item.find('.price').text(),
'deal': item.find('.deal-cnt').text(),
'title': item.find('.title').text(),
'shop': item.find('.shop').text(),
'location': item.find('.location').text()
}
print(product)
save_to_mongo(product)
def save_to_mongo(result):
"""
保存至MongoDB
:param result: 结果
"""
try:
if db[MONGO_COLLECTION].insert_one(result):
print('存储到MongoDB成功')
except Exception:
print('存储到MongoDB失败')
def main():
"""
遍历每一页
"""
for i in range(1, MAX_PAGE + 1):
index_page(i)
browser.close()
if __name__ == '__main__':
main()
关键修改在这里,将浏览器显示出来,先登录再爬
options = webdriver.ChromeOptions()
#chrome_options.add_argument('--headless')
browser = webdriver.Chrome(options=options)
另外将config.py中的内容合并到这里。
也不报错,爬了很长时间还是 正在爬取第 1 页 因为我只想把数据打印出来,所以把数据库的存储方法注释掉了
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.