Giter Site home page Giter Site logo

Comments (8)

wRAR avatar wRAR commented on June 10, 2024

Do you have a disk queue corrupted by e.g. a prior spider crash?

Otherwise you need to make a minimal reproducible example of this problem.

from scrapy.

wRAR avatar wRAR commented on June 10, 2024

should not we check if the key exists first?

No, as it should exist there because of how is the class code written.

from scrapy.

roykyle8 avatar roykyle8 commented on June 10, 2024

@wRAR I clear the request cache directory before starting new run. Starting a fresh crawl with a corrupt queue is not at all possible.

Regarding the reproducible example, the does not occur every time and for every crawl. Hence reproducing it is very very difficult.

from scrapy.

Gallaecio avatar Gallaecio commented on June 10, 2024

I clear the request cache directory before starting new run.

HTTPCACHE_DIR? JOBDIR? Something else?

from scrapy.

roykyle8 avatar roykyle8 commented on June 10, 2024

@Gallaecio I clear both HTTPCACHE_DIR & JOBDIR before starting new run. So I've a custom script to start multiple spiders together (custom_crawler.py).

# custom_crawler.py:
def main(cmd_args):
    crawler_process = CrawlerProcess(None, install_root_handler=False)
    for sp in [cmd_args.spiders]:

        spider_instance = __load_spider_instance__(sp)
        http_cache_dir = f'/opt/http_cache/{spider_instance.name}'
        job_dir = f'/opt/job_dir/{spider_instance.name}'
        __delete_and_recreate_dir__(http_cache_dir)
        __delete_and_recreate_dir__(job_dir)

        custom_settings = {'HTTPCACHE_DIR': http_cache_dir, 'JOBDIR': Job_dir}
        # SCHEDULER_DISK_QUEUE: scrapy.squeues.PickleFifoDiskQueue is set in spider class
        # SCHEDULER_MEMORY_QUEUE: scrapy.squeues.FifoMemoryQueue is set in spider class
        getattr(spider_instance, 'custom_settings').update(custom_settings)
        crawler_process.crawl(Crawler(spider_instance))
     
     crawl_process.start()

The spiders runs fine but I get KeyError for some spiders. Also this error does not occur every run.

from scrapy.

roykyle8 avatar roykyle8 commented on June 10, 2024

@wRAR / @Gallaecio can you please guide how can I find the issue and fix this?

from scrapy.

wRAR avatar wRAR commented on June 10, 2024

No direct ideas but I would start with logging additions and removals for self.queues, to see if the code expectations (that the key always exists there when expected) are indeed broken, or aren't actually broken.

from scrapy.

roykyle8 avatar roykyle8 commented on June 10, 2024

@wRAR @Gallaecio still looking for a solution and trying to debug on KeyError, however started seeing another issue.

2024-03-27 03:30:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:31:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:32:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:33:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:34:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:35:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:36:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:37:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:38:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:39:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:40:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:41:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:42:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:43:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:44:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:45:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:46:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:47:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:48:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:49:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:50:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:51:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:52:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:53:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:54:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:55:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:56:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:57:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:58:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:59:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:00:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:01:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:02:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:03:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:04:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:05:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:06:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:07:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:08:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:09:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:10:09,967 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:11:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:12:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:13:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:14:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:15:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:16:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:17:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:18:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:19:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)

And the spider never completes. I've to kill the process and rerun.

from scrapy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.