Hi, When I try to feed an url with this: curl localhost:5343/feed -H "Content-

maxdepth can not large than 2 about scrapy-cluster HOT 5 CLOSED

anthony9981 commented on July 26, 2024

maxdepth can not large than 2

from scrapy-cluster.

Comments (5)

mrasoolmirzaei commented on July 26, 2024 1

By default, scrapy_cluster won't crawl websites with maxdepth larger than 3. You should change the schema first. To do this, login to your kafka_monitor container:

1- docker exec -it container_id bash
2- cd plugins
3- edit scraper_schema.json (change max value for maxdepth from 3 to anything you want)

from this point, you can crawl websites for maxdepth more than min value and less than the max value you just set.

from scrapy-cluster.

madisonb commented on July 26, 2024 1

I'm happy to chat through custom implementations on Gitter, but per the guidelines I am going to close this issue as a "custom implementation" question which is beyond the scope of a true bug ticket/problem.

More generally - crawling at a depth beyond 2 gets your spider way into the weeds of the internet and is 99% of the time not useful for your actual request. If you wish to crawl at a greater depth you should also implement an allowed_domains filter or regex in the crawl api request to limit your crawler to a specific domain.

If you need to change anything else in the api spec for the request, you can do so at this file https://github.com/istresearch/scrapy-cluster/blob/master/kafka-monitor/plugins/scraper_schema.json

from scrapy-cluster.

anthony9981 commented on July 26, 2024

Hi @NeoArio,
Thanks for your reply.
Your answer helps me a alot.
I have a question if you don't mind:

I have some knowable website, I need to crawl title and content in exactly selector for each of them.
Q1: How I can predefine CSS selector for each of them then feed the monitor only the domain?
Q2: And where I can take the scraped items then store to data base like elasticsearch?
I tried with pipelines (scrapy-elasticsearch) but it will ton of additional request to es server.

Sorry I'm new on scrapy. This is awesome!
Best regards,

from scrapy-cluster.

mrasoolmirzaei commented on July 26, 2024

Hi! I hope you enjoy scraping :D

Q1: I have another database that stores websites CSS and XPath patterns. I don't know if what you want to try is really applicable.
Q2: scraped items will be pushed to the demo.crawled_firehose topic: https://scrapy-cluster.readthedocs.io/en/latest/topics/kafka-monitor/api.html#kafka-topics
Write a code to consume from this topic then do what you want with that data. Finally, you can send it to elasticsearch by another kafka pipline. I think it is better to insert in elasticsearch with bulk request, I mean each insertion contains 100 crawled link. Add a timeout beside this and you are perfect. 100 crawled link or 10 minutes are good conditions to insert in elasticsearch.

from scrapy-cluster.

anthony9981 commented on July 26, 2024

Hi @NeoArio ,
Idea about database that stores the selector is so nice, why I don't think about it before 👍
Could you please show me your?
I came from PHP to python then I'm here so Kafka is new for me:)
Thanks to point me up:) Let me learn it deeper.
Best regards,

from scrapy-cluster.

maxdepth can not large than 2 about scrapy-cluster HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent