Comments (7)
@waldner
v0.9.6 Update caching mechanism: finished job would be cached only once
from scrapydweb.
Commenting out that line makes things much better. Thanks!
from scrapydweb.
Right now I don't have any suggestion, if I happen to stumble across some other issue I will report it then. Thanks!
from scrapydweb.
By default, ScrapydWeb would cache utf8 and stats files in the background periodically to speed up the loading of utf8 and stats html.
If all cachings (it means fetching all logs of running and finished job of all Scrapyd servers and generating the corresponding utf8 and stats html files) are done within 300s, it would wait until reaching 300s and then start a new round of caching. But if the last round of caching took more than 300s, the new round would start immediately.
Please check the setting of CACHE_INTERVAL_SECONDS, you may increase the interval or even set DISABLE_CACHE to True if there are too many logs after your Scrapyd server started up
DISABLE_CACHE = False
CACHE_INTERVAL_SECONDS = 300
And it tries to locate scrapy log in below order, if no JOBJD.log found, then tries JOBID.log.gz, and you may adjust the order for your own case.
SCRAPYD_LOG_EXTENSIONS = ['.log', '.log.gz', '.gz', '.txt', '']
Can you show me some key logs of ScrapydWeb?
BTW, how can you run a machine with two scrapyd instances, by docker?
from scrapydweb.
By default, ScrapydWeb would cache utf8 and stats files in the background periodically to speed up the loading of utf8 and stats html
I understand that scrapydweb wants to cache files, but it seems to me that once a log file is fetched from the source and scrapyd has marked the job as finished it will not change anymore, there's no need to waste CPU and bandwidth refetching it constantly (what if one has 20 or 100 scrapyd instances, each with multiple projects and spiders?).
please check the setting of CACHE_INTERVAL_SECONDS.
My CACHE_INTERVAL_SECONDS
is set to 300 as well. Does it mean that scrapydweb will try to refresh its cache every 300 seconds? If it's so, then it's not working correctly, since it fetches the logs much more often than every 300 seconds.
How can you run a machine with two scrapyd instances, by docker?
Yes I'm using docker to run both scrapyd and scrapydweb.
Can you show me some key logs of ScrapydWeb?
Scrapydweb's logs just show a continuous stream of POSTs to 127.0.0.1 (I suppose to update its caches with the logs fetched from the scrapyd instances):
127.0.0.1 - - [14/Oct/2018 17:35:39] "POST /2/log/stats/project2/spider1/spider1_2018-10-10_16-56-55/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:35:42] "POST /2/log/utf8/project2/spider2/spider2_2018-10-10_17-02-13/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:04] "POST /2/log/stats/project2/spider2/spider2_2018-10-10_17-02-13/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:05] "POST /1/log/utf8/project1/spider3/spider3_2018-10-11_03-02-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:05] "POST /2/log/stats/project1/spider3/spider3_2018-10-11_03-02-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:06] "POST /1/log/utf8/project1/spider4/spider4_2018-10-11_03-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:07] "POST /2/log/stats/project1/spider4/spider4_2018-10-11_03-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:08] "POST /1/log/utf8/project1/spider1/spider1_2018-10-11_03-42-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:14] "POST /2/log/stats/project1/spider1/spider1_2018-10-11_03-42-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:14] "POST /1/log/utf8/project1/spider5/spider5_2018-10-11_04-02-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /2/log/stats/project1/spider5/spider5_2018-10-11_04-02-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /1/log/utf8/project1/spider6/spider6_2018-10-11_04-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /2/log/stats/project1/spider6/spider6_2018-10-11_04-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /1/log/utf8/project1/spider7/spider7_2018-10-11_04-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /2/log/stats/project1/spider7/spider7_2018-10-11_04-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:16] "POST /1/log/utf8/project1/spider8/spider8_2018-10-11_05-02-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:18] "POST /2/log/stats/project1/spider8/spider8_2018-10-11_05-02-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:18] "POST /1/log/utf8/project1/spider9/spider9_2018-10-11_05-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:21] "POST /2/log/stats/project1/spider9/spider9_2018-10-11_05-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:21] "POST /1/log/utf8/project1/spider10/spider10_2018-10-11_06-22-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:21] "POST /2/log/stats/project1/spider10/spider10_2018-10-11_06-22-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:22] "POST /1/log/utf8/project1/spider11/spider11_2018-10-11_06-02-01/ HTTP/1.1" 200 -
...
from scrapydweb.
I have updated my first comment, would imporve the caching mechannism in the next version.
Thank you for your advice!
At this point, you may also comment out the following code to disable caching Scrapy log of finished jobs:
https://github.com/my8100/scrapydweb/blob/master/scrapydweb/cache.py#L52
update_cache('finished')
from scrapydweb.
@waldner
Do you have any other advices for ScrapydWeb, like the Overview page and the Run page, or any other needs?
Thank you in advance.
from scrapydweb.
Related Issues (20)
- Cancel all selected pending jobs
- Http 400 when trying to access HOT 20
- project dependices package version incompatible HOT 3
- Not able to see stats section of the job HOT 1
- scrapydweb failed to run on python 3.8 HOT 5
- 启动报错:sqlite3.OperationalError: no such table: metadata HOT 13
- Is it possible to run multiple spider at the same time in a tmux machine with scrapydweb automatically
- items Oops! Something went wrong. HOT 1
- scrapydweb fresh install won't run HOT 8
- APScheduler 3.10 causing 500 errors HOT 2
- How to Change Timezone of scrapydweb? HOT 3
- Which scrapyd image you use? HOT 2
- Clean install on clean Ubuntu VM. Whatever I do it is not working. HOT 2
- Docker compose scrapdweb with scrapyd the log url use docker name
- Processes dont stop after finishing HOT 1
- v1.4.1 submit cron job can't run HOT 1
- ('Connection aborted.', timeout('timed out',))
- ERROR: Package 'scrapydweb' requires a different Python: HOT 4
- Error while installing scrapydweb HOT 2
- spiders are closed but showing as running/warning in the tasks page
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scrapydweb.