brandicted / scrapy-webdriver Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
class MySpider(CrawlSpider):
start_urls = [
"http://www.example.com",
]
rules = (
Rule(
LxmlLinkExtractor(
allow=[r'\w+/\d+$', r'\w+/\d+-p\d+$'],
),
follow=True
),
Rule(
LxmlLinkExtractor(
allow=(r'\d+.html$'),
),
'parse_action',
),
)
def parse_action(self, response):
yield WebdriverRequest(response.url,
callback=self.parse_item)
def parse_item(self, response):
self.log('received for %s' % response.url, level=log.WARNING)
PhantomJS webdriver does not perform ActionChain when doing a WebdriverActionRequest. See http://stackoverflow.com/questions/16744038/python-bindings-to-selenium-webdriver-actionchain-not-executing-in-phantomjs for a reproducible example.
I get the following stacktrace when I run the pip install
given in the README.md.
>pip install https://github.com/sosign/scrapy-webdriver/archive/master.zip
Collecting https://github.com/sosign/scrapy-webdriver/archive/master.zip
Downloading https://github.com/sosign/scrapy-webdriver/archive/master.zip
\ 20kB 2.6MB/s
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\Tobias\AppData\Local\Temp\pip-x5s6n50e-build\setup.py", lin
e 8, in <module>
from scrapy_webdriver import metadata
File "C:\Users\Tobias\AppData\Local\Temp\pip-x5s6n50e-build\scrapy_webdriv
er\__init__.py", line 3, in <module>
import metadata
ModuleNotFoundError: No module named 'metadata'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in C:\Users\Tobias\A
ppData\Local\Temp\pip-x5s6n50e-build\
My environment is below:
conda list
page screenshot?
navigation events?
executeScript()?
2013-05-11 10:55:01+0800 [xxx] ERROR: Error downloading <GET http://xxxxx.com/item.htm?id=151215152>: unbound method
get() must be called with WebDriver instance as first argument (got str instance instead)
I'm not sure why this is happening yet, any clues? Incorrect usage?
I'm using it as follows:
def parse(self, response):
""" The parse method will get called for any response which doesn't
have a callback set. Parse is the default callback.
So this is where we check for items and or which urls to follow etc.
"""
hxs = HtmlXPathSelector(response)
for url in hxs.select('//a[starts-with(@href, "http")]'
'/@href').extract():
if re.search(r'item.html?\?id=\d+', url):
yield WebdriverRequest(url, callback=self.parse_item)
else:
yield Request(url, callback=self.parse)
Seems the webdriver isn't initialized, will dig some inside the code. Are you still using / maintaining this?
Hi,
Thakns for your work!
I have a question how to get access to webdriver from my own download middleware? I want to take a screenshot of the page when particular information is present.
thanks,
Alexander
If exception is raised in parse method of a WebdriverResponse/WebdriverRequest whole spider closes/exits and doesnt continue
Steps to reproduce:
In any of your parse methods which parse WebDriverResponses raise an exception
Current result:
Scrapy stops crawling
Expected result:
Scrapy continues crawling next requests / urls
When parsing a normal scrapy Request / Response and you raise an error it seems to just continue. I did some quick testing on this, so I may be wrong though. This is a related error log:
2013-05-18 00:10:43+0800 [xxxxxx] ERROR: Spider error processing <GET http://item.xxxxx.com/>
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib/python2.7/site-packages/twisted/internet/task.py", line 607, in _tick
taskObj._oneWorkUnit()
File "/usr/lib/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
yield it.next()
File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line $
8, in process_spider_output
for x in result:
File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy_webdriver/middlewares.py", line 36, in proc$
ss_spider_output
for item_or_request in self._process_requests(result):
File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy_webdriver/middlewares.py", line 51, in _pro$
ess_requests
for request in iter(items_or_requests):
File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line $
2, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", lin$
33, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50$
in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/samos/workspace/alex-scrapy/crawler/spiders/xxx_spider.py", line 50, in parse_item
raise Exception("test")
exceptions.Exception: test
Why are you including distribute 0.6.27 with this package? It makes no sense and is breaking pip and fabric after installing this package. For example:
$ fab
Traceback (most recent call last):
File "/usr/local/bin/fab", line 5, in <module>
from pkg_resources import load_entry_point
File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 2735, in <module>
return mkstemp(*args,**kw)
File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 690, in require
File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 588, in resolve
Unless `replace_conflicting=True`, raises a VersionConflict exception if
pkg_resources.DistributionNotFound: paramiko>=1.10
$ pip install paramiko
Downloading/unpacking paramiko
Downloading paramiko-1.15.1-py2.py3-none-any.whl (165kB): 165kB downloaded
Cleaning up...
Exception:
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 122, in main
status = self.run(options, args)
File "/usr/lib/python2.7/dist-packages/pip/commands/install.py", line 278, in run
requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
File "/usr/lib/python2.7/dist-packages/pip/req.py", line 1259, in prepare_files
)[0]
IndexError: list index out of range
Storing debug log for failure in /root/.pip/pip.log
These all seem like bugs caused by using the old version of distribute found in this package. I don't see a reason to ever include distribute with your package which will overwrite the user's own, but perhaps I'm missing something.
Just to let you know that 9671684 broke compatibility with Scrapy versions lower than 0.18. Using scrapy-webdriver on that changeset with Scrapy 0.16.5 throws the error:
ImportError: Error loading object 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler': No module named http10
I have reverted to using 079ab6a which works correctly.
You should probably specify in the README which Scrapy version scrapy-webdriver supports, although a fix for this should be straightforward.
Also, consider doing a version bump in these cases and uploading the package to PyPI would make things easier.
Thanks for your work!
I saw the request is replaced with dont_filter=True, if I remove that the spider will just stop when it gets to the same url.
I need to use the offsite middleware though, so any thoughts?
I will do some hacking, on a total rewrite where there is no need for the Spider middleware and only DownloaderMiddleware or a normal Downloader. Starting to understand this stuff a little hehe.
I'm currently seeing that its stuck on downloading for a long time, could it be that the request timed out so it won't continue? Are requests currently not concurrent because of the queues? It only takes one out of the queue one by one?
2013-05-14 13:46:23+0800 [scrapy] DEBUG: Downloading http://xxxxl.com/item.html with webdriver
2013-05-14 13:46:32+0800 [xxx] INFO: Crawled 23 pages (at 23 pages/min), scraped 9 items (at 9 items/min)
2013-05-14 13:47:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:48:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:49:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:50:32+0800 [xx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
Feature description:
Add ability to spawn multiple webdrivers so we can scrapy requests concurrently.
For this we need an extra option, max_number of webdriver as it shouldn't grow indefinetly.
The reason that it got stuck on downloading is probably because PhantomJS crashed:
[DEBUG - 2013-05-18T04:28:00.536Z] Session [399aee20-bf06-11e2-a1b3-1ff9fbb8ef48] - _execFuncAndWaitForLoadDecorator - Page Loading in Session: true
[DEBUG - 2013-05-18T04:28:00.637Z] Session [399aee20-bf06-11e2-a1b3-1ff9fbb8ef48] - _execFuncAndWaitForLoadDecorator - Page Loading in Session: true
ExceptionHandler::GenerateDump waitpid failed:No child processes
PhantomJS has crashed. Please read the crash reporting guide at https://github.com/ariya/phantomjs/wiki/Crash-Reporting and file a bug report at https://github.com/ariya/phantomjs/issues/new with the crash dump file attached: /tmp/75f0d88c-1f16-3dd6-4a2892d0-687e48d0.dmp
So we maybe also need a way to check if PhantomJS is still responding and if not we should automatically restart the webdriver/phantomjs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.