I have a machine with two scrapyd instances and one scrapydweb running, scrapydweb is

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Scrapydweb constantly using 100% CPU about scrapydweb HOT 7 CLOSED

my8100 commented on June 23, 2024

Scrapydweb constantly using 100% CPU

from scrapydweb.

Comments (7)

my8100 commented on June 23, 2024 2

@waldner
v0.9.6 Update caching mechanism: finished job would be cached only once

from scrapydweb.

waldner commented on June 23, 2024 1

Commenting out that line makes things much better. Thanks!

from scrapydweb.

waldner commented on June 23, 2024 1

Right now I don't have any suggestion, if I happen to stumble across some other issue I will report it then. Thanks!

from scrapydweb.

my8100 commented on June 23, 2024

By default, ScrapydWeb would cache utf8 and stats files in the background periodically to speed up the loading of utf8 and stats html.
If all cachings (it means fetching all logs of running and finished job of all Scrapyd servers and generating the corresponding utf8 and stats html files) are done within 300s, it would wait until reaching 300s and then start a new round of caching. But if the last round of caching took more than 300s, the new round would start immediately.
Please check the setting of CACHE_INTERVAL_SECONDS, you may increase the interval or even set DISABLE_CACHE to True if there are too many logs after your Scrapyd server started up

DISABLE_CACHE = False
CACHE_INTERVAL_SECONDS = 300

And it tries to locate scrapy log in below order, if no JOBJD.log found, then tries JOBID.log.gz, and you may adjust the order for your own case.

SCRAPYD_LOG_EXTENSIONS = ['.log', '.log.gz', '.gz', '.txt', '']

Can you show me some key logs of ScrapydWeb?

BTW, how can you run a machine with two scrapyd instances, by docker?

from scrapydweb.

waldner commented on June 23, 2024

By default, ScrapydWeb would cache utf8 and stats files in the background periodically to speed up the loading of utf8 and stats html

I understand that scrapydweb wants to cache files, but it seems to me that once a log file is fetched from the source and scrapyd has marked the job as finished it will not change anymore, there's no need to waste CPU and bandwidth refetching it constantly (what if one has 20 or 100 scrapyd instances, each with multiple projects and spiders?).

please check the setting of CACHE_INTERVAL_SECONDS.

My CACHE_INTERVAL_SECONDS is set to 300 as well. Does it mean that scrapydweb will try to refresh its cache every 300 seconds? If it's so, then it's not working correctly, since it fetches the logs much more often than every 300 seconds.

How can you run a machine with two scrapyd instances, by docker?

Yes I'm using docker to run both scrapyd and scrapydweb.

Can you show me some key logs of ScrapydWeb?

Scrapydweb's logs just show a continuous stream of POSTs to 127.0.0.1 (I suppose to update its caches with the logs fetched from the scrapyd instances):

127.0.0.1 - - [14/Oct/2018 17:35:39] "POST /2/log/stats/project2/spider1/spider1_2018-10-10_16-56-55/ HTTP/1.1" 200 - 
127.0.0.1 - - [14/Oct/2018 17:35:42] "POST /2/log/utf8/project2/spider2/spider2_2018-10-10_17-02-13/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:04] "POST /2/log/stats/project2/spider2/spider2_2018-10-10_17-02-13/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:05] "POST /1/log/utf8/project1/spider3/spider3_2018-10-11_03-02-01/ HTTP/1.1" 200 - 
127.0.0.1 - - [14/Oct/2018 17:36:05] "POST /2/log/stats/project1/spider3/spider3_2018-10-11_03-02-01/ HTTP/1.1" 200 - 
127.0.0.1 - - [14/Oct/2018 17:36:06] "POST /1/log/utf8/project1/spider4/spider4_2018-10-11_03-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:07] "POST /2/log/stats/project1/spider4/spider4_2018-10-11_03-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:08] "POST /1/log/utf8/project1/spider1/spider1_2018-10-11_03-42-02/ HTTP/1.1" 200 - 
127.0.0.1 - - [14/Oct/2018 17:36:14] "POST /2/log/stats/project1/spider1/spider1_2018-10-11_03-42-02/ HTTP/1.1" 200 - 
127.0.0.1 - - [14/Oct/2018 17:36:14] "POST /1/log/utf8/project1/spider5/spider5_2018-10-11_04-02-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /2/log/stats/project1/spider5/spider5_2018-10-11_04-02-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /1/log/utf8/project1/spider6/spider6_2018-10-11_04-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /2/log/stats/project1/spider6/spider6_2018-10-11_04-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /1/log/utf8/project1/spider7/spider7_2018-10-11_04-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /2/log/stats/project1/spider7/spider7_2018-10-11_04-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:16] "POST /1/log/utf8/project1/spider8/spider8_2018-10-11_05-02-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:18] "POST /2/log/stats/project1/spider8/spider8_2018-10-11_05-02-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:18] "POST /1/log/utf8/project1/spider9/spider9_2018-10-11_05-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:21] "POST /2/log/stats/project1/spider9/spider9_2018-10-11_05-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:21] "POST /1/log/utf8/project1/spider10/spider10_2018-10-11_06-22-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:21] "POST /2/log/stats/project1/spider10/spider10_2018-10-11_06-22-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:22] "POST /1/log/utf8/project1/spider11/spider11_2018-10-11_06-02-01/ HTTP/1.1" 200 -
...

from scrapydweb.

my8100 commented on June 23, 2024

I have updated my first comment, would imporve the caching mechannism in the next version.
Thank you for your advice!

At this point, you may also comment out the following code to disable caching Scrapy log of finished jobs:
https://github.com/my8100/scrapydweb/blob/master/scrapydweb/cache.py#L52
update_cache('finished')

from scrapydweb.

my8100 commented on June 23, 2024

@waldner
Do you have any other advices for ScrapydWeb, like the Overview page and the Run page, or any other needs?
Thank you in advance.

from scrapydweb.

Scrapydweb constantly using 100% CPU about scrapydweb HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent