I think Postgres would be a good way to store the spider state to in case the system c

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Discussion: Store state for fault-tolerance crawler about crawly HOT 5 CLOSED

elixir-crawly commented on August 28, 2024

Discussion: Store state for fault-tolerance crawler

from crawly.

Comments (5)

Ziinc commented on August 28, 2024

You can use a custom pipeline to update your database tables with your scraped data.

When your spiders are started, your starting urls can then be queried from the database.

You can use an ecto-based job queue like Honeydew to poll your database for things to be scraped and startup the relevant spiders accordingly. Or you can use Mnesia to persist state.

from crawly.

commented on August 28, 2024

@Ziinc Thanks again. Soon will do.

from crawly.

lucaong commented on August 28, 2024

Pitching in to mention CubQ as a way to implement durable queues in an embedded database. Full disclaimer: I am the author of the library. I created it mostly for embedded software scenarios, but I think it would fit well for keeping a crawler queue too.

from crawly.

Ziinc commented on August 28, 2024

For improved fault tolerance, usage of persistent queues for requests/ScrapedItems would definitely be good.

However, this would involve an additional dependency, and it would be hard to argue for CubQ (backed by CubDb) over other more established queue libraries backed by Mnesia.

from crawly.

Ziinc commented on August 28, 2024

I don't think that there is a strong case for improving fault tolerance and stability now when there are other features to be implemented.

Perhaps in the future, we can reopen this when there is a v1.

from crawly.

Discussion: Store state for fault-tolerance crawler about crawly HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent