brandicted / scrapy-webdriver Goto Github PK

View Code? Open in Web Editor NEW

144.0 144.0 64.0 345 KB

License: MIT License

Python 100.00%

scrapy-webdriver's People

Contributors

Stargazers

Watchers

Forkers

samos123 infin8 scraping-xx iiiypuk09 stringertheory apalazon jeffamcgee djm yupengyan nside hashirharis ncadou huokedu ian-huu bopo titanjer vantienvnn mezz80 erogol smarthomekit outstandingcandy tienhv gerosalesc michaelgerace stephenbrown robdefeo agz1990 ms5 commonfire wndwcq willet jc0n miminus leeomar durban89 flagist0 chishaku yuantaotao vls superf2t vuchau biancazzurri wiesel2 micmejia tonal robic timchy cleocn highcoldhc haochuang lightflyer pythonhacker hw20686832 matthewjablack agustinmaletti yanggangzi zanachka iq-scm

scrapy-webdriver's Issues

WebdriverRequest can't callback parse_item

class MySpider(CrawlSpider):
start_urls = [
"http://www.example.com",
]
rules = (
Rule(
LxmlLinkExtractor(
allow=[r'\w+/\d+$', r'\w+/\d+-p\d+$'],
),
follow=True
),
Rule(
LxmlLinkExtractor(
allow=(r'\d+.html$'),
),
'parse_action',
),
)
def parse_action(self, response):
yield WebdriverRequest(response.url,
callback=self.parse_item)

def parse_item(self, response):
self.log('received for %s' % response.url, level=log.WARNING)

WebdriverActionRequest never executing ActionChain with PhantomJS

PhantomJS webdriver does not perform ActionChain when doing a WebdriverActionRequest. See http://stackoverflow.com/questions/16744038/python-bindings-to-selenium-webdriver-actionchain-not-executing-in-phantomjs for a reproducible example.

Installation fails in windows 64bit conda environment

I get the following stacktrace when I run the pip install given in the README.md.

>pip install https://github.com/sosign/scrapy-webdriver/archive/master.zip

Collecting https://github.com/sosign/scrapy-webdriver/archive/master.zip
  Downloading https://github.com/sosign/scrapy-webdriver/archive/master.zip
     \ 20kB 2.6MB/s
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\Tobias\AppData\Local\Temp\pip-x5s6n50e-build\setup.py", lin
e 8, in <module>
        from scrapy_webdriver import metadata
      File "C:\Users\Tobias\AppData\Local\Temp\pip-x5s6n50e-build\scrapy_webdriv
er\__init__.py", line 3, in <module>
        import metadata
    ModuleNotFoundError: No module named 'metadata'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in C:\Users\Tobias\A
ppData\Local\Temp\pip-x5s6n50e-build\

My environment is below:
conda list

what are some unsupported methods?

page screenshot?
navigation events?
executeScript()?

Error: unbound method get() must be called with WebDriver instance as first argument

2013-05-11 10:55:01+0800 [xxx] ERROR: Error downloading <GET http://xxxxx.com/item.htm?id=151215152>: unbound method
get() must be called with WebDriver instance as first argument (got str instance instead)

I'm not sure why this is happening yet, any clues? Incorrect usage?

I'm using it as follows:

    def parse(self, response):
        """ The parse method will get called for any response which doesn't
        have a callback set. Parse is the default callback.

        So this is where we check for items and or which urls to follow etc.
        """
        hxs = HtmlXPathSelector(response)
        for url in hxs.select('//a[starts-with(@href, "http")]'
                              '/@href').extract():
            if re.search(r'item.html?\?id=\d+', url):
                yield WebdriverRequest(url, callback=self.parse_item)
            else:
                yield Request(url, callback=self.parse)

Seems the webdriver isn't initialized, will dig some inside the code. Are you still using / maintaining this?

How to get access to webdriver from my own download middleware?

Hi,
Thakns for your work!

I have a question how to get access to webdriver from my own download middleware? I want to take a screenshot of the page when particular information is present.

thanks,
Alexander

Spider closes on exception

If exception is raised in parse method of a WebdriverResponse/WebdriverRequest whole spider closes/exits and doesnt continue

Steps to reproduce:
In any of your parse methods which parse WebDriverResponses raise an exception

Current result:
Scrapy stops crawling

Expected result:
Scrapy continues crawling next requests / urls

When parsing a normal scrapy Request / Response and you raise an error it seems to just continue. I did some quick testing on this, so I may be wrong though. This is a related error log:

2013-05-18 00:10:43+0800 [xxxxxx] ERROR: Spider error processing <GET http://item.xxxxx.com/>
        Traceback (most recent call last):
          File "/usr/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
            call.func(*call.args, **call.kw)
          File "/usr/lib/python2.7/site-packages/twisted/internet/task.py", line 607, in _tick
            taskObj._oneWorkUnit()
          File "/usr/lib/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
            result = next(self._iterator)
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
            work = (callable(elem, *args, **named) for elem in iterable)
        --- <exception caught here> ---
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
            yield it.next()
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line $
8, in process_spider_output
            for x in result:
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy_webdriver/middlewares.py", line 36, in proc$
ss_spider_output
            for item_or_request in self._process_requests(result):
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy_webdriver/middlewares.py", line 51, in _pro$
ess_requests
            for request in iter(items_or_requests):
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line $
2, in <genexpr>
            return (_set_referer(r) for r in result or ())
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", lin$
 33, in <genexpr>
            return (r for r in result or () if _filter(r))
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50$
 in <genexpr>
            return (r for r in result or () if _filter(r))
          File "/home/samos/workspace/alex-scrapy/crawler/spiders/xxx_spider.py", line 50, in parse_item
            raise Exception("test")
        exceptions.Exception: test

distribute installation breaks pip, fabric, etc.

Why are you including distribute 0.6.27 with this package? It makes no sense and is breaking pip and fabric after installing this package. For example:

$ fab
Traceback (most recent call last):
  File "/usr/local/bin/fab", line 5, in <module>
    from pkg_resources import load_entry_point
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 2735, in <module>
    return mkstemp(*args,**kw)
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 690, in require

  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 588, in resolve
    Unless `replace_conflicting=True`, raises a VersionConflict exception if
pkg_resources.DistributionNotFound: paramiko>=1.10

$ pip install paramiko
Downloading/unpacking paramiko
  Downloading paramiko-1.15.1-py2.py3-none-any.whl (165kB): 165kB downloaded
Cleaning up...
Exception:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 122, in main
    status = self.run(options, args)
  File "/usr/lib/python2.7/dist-packages/pip/commands/install.py", line 278, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/usr/lib/python2.7/dist-packages/pip/req.py", line 1259, in prepare_files
    )[0]
IndexError: list index out of range

Storing debug log for failure in /root/.pip/pip.log

These all seem like bugs caused by using the old version of distribute found in this package. I don't see a reason to ever include distribute with your package which will overwrite the user's own, but perhaps I'm missing something.

Broken compatibility with Scrapy <0.18

Just to let you know that 9671684 broke compatibility with Scrapy versions lower than 0.18. Using scrapy-webdriver on that changeset with Scrapy 0.16.5 throws the error:

ImportError: Error loading object 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler': No module named http10

I have reverted to using 079ab6a which works correctly.

You should probably specify in the README which Scrapy version scrapy-webdriver supports, although a fix for this should be straightforward.

Also, consider doing a version bump in these cases and uploading the package to PyPI would make things easier.

Thanks for your work!

OffsiteMiddleware not working

I saw the request is replaced with dont_filter=True, if I remove that the spider will just stop when it gets to the same url.

I need to use the offsite middleware though, so any thoughts?

I will do some hacking, on a total rewrite where there is no need for the Spider middleware and only DownloaderMiddleware or a normal Downloader. Starting to understand this stuff a little hehe.

Stuck on Downloading for a long time

I'm currently seeing that its stuck on downloading for a long time, could it be that the request timed out so it won't continue? Are requests currently not concurrent because of the queues? It only takes one out of the queue one by one?

2013-05-14 13:46:23+0800 [scrapy] DEBUG: Downloading http://xxxxl.com/item.html with webdriver
2013-05-14 13:46:32+0800 [xxx] INFO: Crawled 23 pages (at 23 pages/min), scraped 9 items (at 9 items/min)
2013-05-14 13:47:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:48:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:49:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:50:32+0800 [xx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)

Feature description:
Add ability to spawn multiple webdrivers so we can scrapy requests concurrently.

For this we need an extra option, max_number of webdriver as it shouldn't grow indefinetly.

The reason that it got stuck on downloading is probably because PhantomJS crashed:

[DEBUG - 2013-05-18T04:28:00.536Z] Session [399aee20-bf06-11e2-a1b3-1ff9fbb8ef48] - _execFuncAndWaitForLoadDecorator - Page Loading in Session: true
[DEBUG - 2013-05-18T04:28:00.637Z] Session [399aee20-bf06-11e2-a1b3-1ff9fbb8ef48] - _execFuncAndWaitForLoadDecorator - Page Loading in Session: true
ExceptionHandler::GenerateDump waitpid failed:No child processes
PhantomJS has crashed. Please read the crash reporting guide at https://github.com/ariya/phantomjs/wiki/Crash-Reporting and file a bug report at https://github.com/ariya/phantomjs/issues/new with the crash dump file attached: /tmp/75f0d88c-1f16-3dd6-4a2892d0-687e48d0.dmp

So we maybe also need a way to check if PhantomJS is still responding and if not we should automatically restart the webdriver/phantomjs.

brandicted / scrapy-webdriver Goto Github PK

scrapy-webdriver's People

Contributors

Stargazers

Watchers

Forkers

scrapy-webdriver's Issues

WebdriverRequest can't callback parse_item

WebdriverActionRequest never executing ActionChain with PhantomJS

Installation fails in windows 64bit conda environment

what are some unsupported methods?

Error: unbound method get() must be called with WebDriver instance as first argument

How to get access to webdriver from my own download middleware?

Spider closes on exception

distribute installation breaks pip, fabric, etc.

Broken compatibility with Scrapy <0.18

OffsiteMiddleware not working

Stuck on Downloading for a long time

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent