Giter Site home page Giter Site logo

Comments (7)

my8100 avatar my8100 commented on June 23, 2024 2

@waldner
v0.9.6 Update caching mechanism: finished job would be cached only once

from scrapydweb.

waldner avatar waldner commented on June 23, 2024 1

Commenting out that line makes things much better. Thanks!

from scrapydweb.

waldner avatar waldner commented on June 23, 2024 1

Right now I don't have any suggestion, if I happen to stumble across some other issue I will report it then. Thanks!

from scrapydweb.

my8100 avatar my8100 commented on June 23, 2024

By default, ScrapydWeb would cache utf8 and stats files in the background periodically to speed up the loading of utf8 and stats html.
If all cachings (it means fetching all logs of running and finished job of all Scrapyd servers and generating the corresponding utf8 and stats html files) are done within 300s, it would wait until reaching 300s and then start a new round of caching. But if the last round of caching took more than 300s, the new round would start immediately.
Please check the setting of CACHE_INTERVAL_SECONDS, you may increase the interval or even set DISABLE_CACHE to True if there are too many logs after your Scrapyd server started up

DISABLE_CACHE = False
CACHE_INTERVAL_SECONDS = 300

And it tries to locate scrapy log in below order, if no JOBJD.log found, then tries JOBID.log.gz, and you may adjust the order for your own case.

SCRAPYD_LOG_EXTENSIONS = ['.log', '.log.gz', '.gz', '.txt', '']

Can you show me some key logs of ScrapydWeb?

BTW, how can you run a machine with two scrapyd instances, by docker?

from scrapydweb.

waldner avatar waldner commented on June 23, 2024

By default, ScrapydWeb would cache utf8 and stats files in the background periodically to speed up the loading of utf8 and stats html

I understand that scrapydweb wants to cache files, but it seems to me that once a log file is fetched from the source and scrapyd has marked the job as finished it will not change anymore, there's no need to waste CPU and bandwidth refetching it constantly (what if one has 20 or 100 scrapyd instances, each with multiple projects and spiders?).

please check the setting of CACHE_INTERVAL_SECONDS.

My CACHE_INTERVAL_SECONDS is set to 300 as well. Does it mean that scrapydweb will try to refresh its cache every 300 seconds? If it's so, then it's not working correctly, since it fetches the logs much more often than every 300 seconds.

How can you run a machine with two scrapyd instances, by docker?

Yes I'm using docker to run both scrapyd and scrapydweb.

Can you show me some key logs of ScrapydWeb?

Scrapydweb's logs just show a continuous stream of POSTs to 127.0.0.1 (I suppose to update its caches with the logs fetched from the scrapyd instances):

127.0.0.1 - - [14/Oct/2018 17:35:39] "POST /2/log/stats/project2/spider1/spider1_2018-10-10_16-56-55/ HTTP/1.1" 200 - 
127.0.0.1 - - [14/Oct/2018 17:35:42] "POST /2/log/utf8/project2/spider2/spider2_2018-10-10_17-02-13/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:04] "POST /2/log/stats/project2/spider2/spider2_2018-10-10_17-02-13/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:05] "POST /1/log/utf8/project1/spider3/spider3_2018-10-11_03-02-01/ HTTP/1.1" 200 - 
127.0.0.1 - - [14/Oct/2018 17:36:05] "POST /2/log/stats/project1/spider3/spider3_2018-10-11_03-02-01/ HTTP/1.1" 200 - 
127.0.0.1 - - [14/Oct/2018 17:36:06] "POST /1/log/utf8/project1/spider4/spider4_2018-10-11_03-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:07] "POST /2/log/stats/project1/spider4/spider4_2018-10-11_03-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:08] "POST /1/log/utf8/project1/spider1/spider1_2018-10-11_03-42-02/ HTTP/1.1" 200 - 
127.0.0.1 - - [14/Oct/2018 17:36:14] "POST /2/log/stats/project1/spider1/spider1_2018-10-11_03-42-02/ HTTP/1.1" 200 - 
127.0.0.1 - - [14/Oct/2018 17:36:14] "POST /1/log/utf8/project1/spider5/spider5_2018-10-11_04-02-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /2/log/stats/project1/spider5/spider5_2018-10-11_04-02-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /1/log/utf8/project1/spider6/spider6_2018-10-11_04-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /2/log/stats/project1/spider6/spider6_2018-10-11_04-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /1/log/utf8/project1/spider7/spider7_2018-10-11_04-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /2/log/stats/project1/spider7/spider7_2018-10-11_04-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:16] "POST /1/log/utf8/project1/spider8/spider8_2018-10-11_05-02-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:18] "POST /2/log/stats/project1/spider8/spider8_2018-10-11_05-02-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:18] "POST /1/log/utf8/project1/spider9/spider9_2018-10-11_05-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:21] "POST /2/log/stats/project1/spider9/spider9_2018-10-11_05-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:21] "POST /1/log/utf8/project1/spider10/spider10_2018-10-11_06-22-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:21] "POST /2/log/stats/project1/spider10/spider10_2018-10-11_06-22-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:22] "POST /1/log/utf8/project1/spider11/spider11_2018-10-11_06-02-01/ HTTP/1.1" 200 -
...

from scrapydweb.

my8100 avatar my8100 commented on June 23, 2024

I have updated my first comment, would imporve the caching mechannism in the next version.
Thank you for your advice!

At this point, you may also comment out the following code to disable caching Scrapy log of finished jobs:
https://github.com/my8100/scrapydweb/blob/master/scrapydweb/cache.py#L52
update_cache('finished')

from scrapydweb.

my8100 avatar my8100 commented on June 23, 2024

@waldner
Do you have any other advices for ScrapydWeb, like the Overview page and the Run page, or any other needs?
Thank you in advance.

from scrapydweb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.