Giter Site home page Giter Site logo

Comments (5)

mrasoolmirzaei avatar mrasoolmirzaei commented on July 26, 2024 1

By default, scrapy_cluster won't crawl websites with maxdepth larger than 3. You should change the schema first. To do this, login to your kafka_monitor container:

1- docker exec -it container_id bash
2- cd plugins
3- edit scraper_schema.json (change max value for maxdepth from 3 to anything you want)

from this point, you can crawl websites for maxdepth more than min value and less than the max value you just set.

from scrapy-cluster.

madisonb avatar madisonb commented on July 26, 2024 1

I'm happy to chat through custom implementations on Gitter, but per the guidelines I am going to close this issue as a "custom implementation" question which is beyond the scope of a true bug ticket/problem.

More generally - crawling at a depth beyond 2 gets your spider way into the weeds of the internet and is 99% of the time not useful for your actual request. If you wish to crawl at a greater depth you should also implement an allowed_domains filter or regex in the crawl api request to limit your crawler to a specific domain.

If you need to change anything else in the api spec for the request, you can do so at this file https://github.com/istresearch/scrapy-cluster/blob/master/kafka-monitor/plugins/scraper_schema.json

from scrapy-cluster.

anthony9981 avatar anthony9981 commented on July 26, 2024

Hi @NeoArio,
Thanks for your reply.
Your answer helps me a alot.
I have a question if you don't mind:

I have some knowable website, I need to crawl title and content in exactly selector for each of them.
Q1: How I can predefine CSS selector for each of them then feed the monitor only the domain?
Q2: And where I can take the scraped items then store to data base like elasticsearch?
I tried with pipelines (scrapy-elasticsearch) but it will ton of additional request to es server.

Sorry I'm new on scrapy. This is awesome!
Best regards,

from scrapy-cluster.

mrasoolmirzaei avatar mrasoolmirzaei commented on July 26, 2024

Hi! I hope you enjoy scraping :D

Q1: I have another database that stores websites CSS and XPath patterns. I don't know if what you want to try is really applicable.
Q2: scraped items will be pushed to the demo.crawled_firehose topic: https://scrapy-cluster.readthedocs.io/en/latest/topics/kafka-monitor/api.html#kafka-topics
Write a code to consume from this topic then do what you want with that data. Finally, you can send it to elasticsearch by another kafka pipline. I think it is better to insert in elasticsearch with bulk request, I mean each insertion contains 100 crawled link. Add a timeout beside this and you are perfect. 100 crawled link or 10 minutes are good conditions to insert in elasticsearch.

from scrapy-cluster.

anthony9981 avatar anthony9981 commented on July 26, 2024

Hi @NeoArio ,
Idea about database that stores the selector is so nice, why I don't think about it before 👍
Could you please show me your?
I came from PHP to python then I'm here so Kafka is new for me:)
Thanks to point me up:) Let me learn it deeper.
Best regards,

from scrapy-cluster.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.