Giter Site home page Giter Site logo

scrapy-webdriver's People

Contributors

ncadou avatar samos123 avatar titanjer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapy-webdriver's Issues

WebdriverRequest can't callback parse_item

class MySpider(CrawlSpider):
start_urls = [
"http://www.example.com",
]
rules = (
Rule(
LxmlLinkExtractor(
allow=[r'\w+/\d+$', r'\w+/\d+-p\d+$'],
),
follow=True
),
Rule(
LxmlLinkExtractor(
allow=(r'\d+.html$'),
),
'parse_action',
),
)
def parse_action(self, response):
yield WebdriverRequest(response.url,
callback=self.parse_item)

def parse_item(self, response):
self.log('received for %s' % response.url, level=log.WARNING)

Installation fails in windows 64bit conda environment

I get the following stacktrace when I run the pip install given in the README.md.

>pip install https://github.com/sosign/scrapy-webdriver/archive/master.zip
Collecting https://github.com/sosign/scrapy-webdriver/archive/master.zip
  Downloading https://github.com/sosign/scrapy-webdriver/archive/master.zip
     \ 20kB 2.6MB/s
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\Tobias\AppData\Local\Temp\pip-x5s6n50e-build\setup.py", lin
e 8, in <module>
        from scrapy_webdriver import metadata
      File "C:\Users\Tobias\AppData\Local\Temp\pip-x5s6n50e-build\scrapy_webdriv
er\__init__.py", line 3, in <module>
        import metadata
    ModuleNotFoundError: No module named 'metadata'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in C:\Users\Tobias\A
ppData\Local\Temp\pip-x5s6n50e-build\

My environment is below:
conda list

Error: unbound method get() must be called with WebDriver instance as first argument

2013-05-11 10:55:01+0800 [xxx] ERROR: Error downloading <GET http://xxxxx.com/item.htm?id=151215152>: unbound method
get() must be called with WebDriver instance as first argument (got str instance instead)

I'm not sure why this is happening yet, any clues? Incorrect usage?

I'm using it as follows:

    def parse(self, response):
        """ The parse method will get called for any response which doesn't
        have a callback set. Parse is the default callback.

        So this is where we check for items and or which urls to follow etc.
        """
        hxs = HtmlXPathSelector(response)
        for url in hxs.select('//a[starts-with(@href, "http")]'
                              '/@href').extract():
            if re.search(r'item.html?\?id=\d+', url):
                yield WebdriverRequest(url, callback=self.parse_item)
            else:
                yield Request(url, callback=self.parse)

Seems the webdriver isn't initialized, will dig some inside the code. Are you still using / maintaining this?

Spider closes on exception

If exception is raised in parse method of a WebdriverResponse/WebdriverRequest whole spider closes/exits and doesnt continue

Steps to reproduce:
In any of your parse methods which parse WebDriverResponses raise an exception

Current result:
Scrapy stops crawling

Expected result:
Scrapy continues crawling next requests / urls

When parsing a normal scrapy Request / Response and you raise an error it seems to just continue. I did some quick testing on this, so I may be wrong though. This is a related error log:

2013-05-18 00:10:43+0800 [xxxxxx] ERROR: Spider error processing <GET http://item.xxxxx.com/>
        Traceback (most recent call last):
          File "/usr/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
            call.func(*call.args, **call.kw)
          File "/usr/lib/python2.7/site-packages/twisted/internet/task.py", line 607, in _tick
            taskObj._oneWorkUnit()
          File "/usr/lib/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
            result = next(self._iterator)
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
            work = (callable(elem, *args, **named) for elem in iterable)
        --- <exception caught here> ---
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
            yield it.next()
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line $
8, in process_spider_output
            for x in result:
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy_webdriver/middlewares.py", line 36, in proc$
ss_spider_output
            for item_or_request in self._process_requests(result):
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy_webdriver/middlewares.py", line 51, in _pro$
ess_requests
            for request in iter(items_or_requests):
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line $
2, in <genexpr>
            return (_set_referer(r) for r in result or ())
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", lin$
 33, in <genexpr>
            return (r for r in result or () if _filter(r))
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50$
 in <genexpr>
            return (r for r in result or () if _filter(r))
          File "/home/samos/workspace/alex-scrapy/crawler/spiders/xxx_spider.py", line 50, in parse_item
            raise Exception("test")
        exceptions.Exception: test

distribute installation breaks pip, fabric, etc.

Why are you including distribute 0.6.27 with this package? It makes no sense and is breaking pip and fabric after installing this package. For example:

$ fab
Traceback (most recent call last):
  File "/usr/local/bin/fab", line 5, in <module>
    from pkg_resources import load_entry_point
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 2735, in <module>
    return mkstemp(*args,**kw)
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 690, in require

  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 588, in resolve
    Unless `replace_conflicting=True`, raises a VersionConflict exception if
pkg_resources.DistributionNotFound: paramiko>=1.10
$ pip install paramiko
Downloading/unpacking paramiko
  Downloading paramiko-1.15.1-py2.py3-none-any.whl (165kB): 165kB downloaded
Cleaning up...
Exception:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 122, in main
    status = self.run(options, args)
  File "/usr/lib/python2.7/dist-packages/pip/commands/install.py", line 278, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/usr/lib/python2.7/dist-packages/pip/req.py", line 1259, in prepare_files
    )[0]
IndexError: list index out of range

Storing debug log for failure in /root/.pip/pip.log

These all seem like bugs caused by using the old version of distribute found in this package. I don't see a reason to ever include distribute with your package which will overwrite the user's own, but perhaps I'm missing something.

Broken compatibility with Scrapy <0.18

Just to let you know that 9671684 broke compatibility with Scrapy versions lower than 0.18. Using scrapy-webdriver on that changeset with Scrapy 0.16.5 throws the error:

ImportError: Error loading object 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler': No module named http10

I have reverted to using 079ab6a which works correctly.

You should probably specify in the README which Scrapy version scrapy-webdriver supports, although a fix for this should be straightforward.

Also, consider doing a version bump in these cases and uploading the package to PyPI would make things easier.

Thanks for your work!

OffsiteMiddleware not working

I saw the request is replaced with dont_filter=True, if I remove that the spider will just stop when it gets to the same url.

I need to use the offsite middleware though, so any thoughts?

I will do some hacking, on a total rewrite where there is no need for the Spider middleware and only DownloaderMiddleware or a normal Downloader. Starting to understand this stuff a little hehe.

Stuck on Downloading for a long time

I'm currently seeing that its stuck on downloading for a long time, could it be that the request timed out so it won't continue? Are requests currently not concurrent because of the queues? It only takes one out of the queue one by one?

2013-05-14 13:46:23+0800 [scrapy] DEBUG: Downloading http://xxxxl.com/item.html with webdriver
2013-05-14 13:46:32+0800 [xxx] INFO: Crawled 23 pages (at 23 pages/min), scraped 9 items (at 9 items/min)
2013-05-14 13:47:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:48:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:49:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:50:32+0800 [xx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)  

Feature description:
Add ability to spawn multiple webdrivers so we can scrapy requests concurrently.

For this we need an extra option, max_number of webdriver as it shouldn't grow indefinetly.

The reason that it got stuck on downloading is probably because PhantomJS crashed:

[DEBUG - 2013-05-18T04:28:00.536Z] Session [399aee20-bf06-11e2-a1b3-1ff9fbb8ef48] - _execFuncAndWaitForLoadDecorator - Page Loading in Session: true
[DEBUG - 2013-05-18T04:28:00.637Z] Session [399aee20-bf06-11e2-a1b3-1ff9fbb8ef48] - _execFuncAndWaitForLoadDecorator - Page Loading in Session: true
ExceptionHandler::GenerateDump waitpid failed:No child processes
PhantomJS has crashed. Please read the crash reporting guide at https://github.com/ariya/phantomjs/wiki/Crash-Reporting and file a bug report at https://github.com/ariya/phantomjs/issues/new with the crash dump file attached: /tmp/75f0d88c-1f16-3dd6-4a2892d0-687e48d0.dmp

So we maybe also need a way to check if PhantomJS is still responding and if not we should automatically restart the webdriver/phantomjs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.