ziinc / crawldis-old Goto Github PK

View Code? Open in Web Editor NEW

This project forked from elixir-crawly/crawly

2.0 2.0 0.0 2.13 MB

License: Apache License 2.0

Elixir 92.32% Shell 0.14% CSS 5.46% JavaScript 0.73% HTML 1.11% Makefile 0.24%

crawldis-old's Issues

app/crawly: remove splash support

Not supported. Opt for puppeteer/playwright for JS sites.

Remove fetcher

requestor: add middleware piping logic

Middlewares should be piped per request just before executing the fetch. At this stage, we can drop the request as needed.

can see details for a job

able to see crawl duration, start timestamp, stop timestamp
able to see parsing config
able to see start urls

app/crawly: Silence test logging

Test logs are very noisy and unnecessary when running tests.

app/crawly: rename Crawly.ParsedItem to Crawly.Extracted

ParsedItem is a little ambiguous. Need to reduce usage of the term "item" as well, as it is very confusing.

app/requestor: Can receive a Request from Web and begin crawling

Once a crawl job has started #4 , initial requests from Web will be passed to the Requestor pool leader.

A few ways to handle crawling:

Request delegating: The requestor leader handles request delegation, enqueuing request to specific requestors based on their queue sizes.
- drawbacks: if the node goes down, all enqueued requests for that node is lost
Central cache: Requests are stored on a central cache, and this cache is used to store all requests. Requestors then pop off new requests to crawl.
- drawbacks: if the node goes down, all requests for the cluster is lost
Distributed cache: using a crdt, we can have a distributed cache that allows all requestors to store a copy of the cache, allowing data redundancy and fault tolerance.
- drawbacks: requests may be crawled multiple times, due to the non-atomic replication.
Distributed cache with consensus crawling: use a crdt to cache the requests, but use a consensus algo to determine the crawl sequence.
- drawback: determining consensus would be slow, would not scale for fast crawling requirements.

delta crdt for fast distributed cache syncing looks promising

https://hexdocs.pm/delta_crdt/DeltaCrdt.html#start_link/2

22 Apr Updates

Decided to experiment with DeltaCRDT for data syncing across nodes. We are able to also implement a separate storage mechanism for the CRDT as well, which opens the doors to using ETS for storing data in memory.

able to store a request to the req storage worker (currently uses crawly's genserver state storage mechanism)
able to fetch the response (using Crawly.Worker.get_response/1). Crawly.Worker performs full-blown request-response processing, which is not what we want for the Requestor's implementation, which should only fetch the response and parse it.
#7

Requestor also needs to be able to store the crawl's config.

using the same Crawly mental model, we can think of a crawl's config comprising of a spider's start urls, the parsing logic (request and parsed item extraction), and the parsed item processing logic.
Technically, if we don't want to process the parsed item, then an empty parsed-item-processing config would skip over the Processor.
This means that only the start urls and parsing config is

24 Apr Update

Request queuing requirements: each request needs to be queued in a distributed fashion while ensuring that Requestors do not do duplicate work. as such, good way to minimize overlap is to allow Requestors to "claim" jobs before actually doing any work on them. Requestors can only claim unclaimed requests.

So the lifecycle of a request in the queue is:

unclaimed -> claimed -> popped

When popped from the queue, it no longer appears in the queue.

Internal state held in the crdt is "https://www...." => {:unclaimed, %Request{...}.

can queue a request
can claim a request
can pop a request
can replicate the queue across nodes

Middlewares, such as retrying logic, is optional. internally, all crawly middlewares should be optional. Hence we wil omit them for now.

jobber: crawl jobs should hold config state across nodes

Centralized crawling management, connected to all nodes. Each node has a process that interfaces with the management node(s)

Each job starts a certain number of Requestors & Processors.

Run on same node with cluster

should broadcast stats to management node
scaling requestors/processors with monitoring (linear increase/ buffer/ )
warning alerts
web api

requestor: retry request pipeline

request_queue: should use common Syncer module for node-level syncing

app/crawly: remove api functionality

Failing test, and http api is unncessary.

Remove related scheduling logic as well.

app/web: Can create a spider

Can CRUD the spider.

What is a spider

Spiders specify a specific page parsing pattern. E.g. extract all urls with this glob pattern, then extract all text with this xpath.

Spiders are made up of a group of selected Parser modules and a configuration for each. They will perform the parsing of new Requests and ParsedItems.

able to restart a crawl job

web: can initialize immediate crawl jobs

A crawl job is the scope of parsing work performed by a spider. All requests and parsed items will be linked to a crawl job.
Each job is given a job id.

For web, job management is linked to actual crawl jobs created from in cluster.

Consider using oban to manage persistent crawls. crawl jobs on cluster is not persistent, cluster has no persistence layer. on the other hand, web management should have a persistence layer for more possible functionality, like storing history etc.

v1:

see all running jobs
start/stop a job

v2:

see stats for a job
historical introspection
restarting
arguments
long running jobs
timeouts
scheduling

requestor: fetcher can drop requests that have unexpected responses

if piping through the fetcher and the fetcher returns {false, state}, then should drop the request.

app/requestor: parse a response using a variable configuration.

Config - Parsing One way of viewing parsing config is through modules, each extracting text. This text can then either be converted into new requests, or passed on as parsed items. Right now, i can think of a few:

xpath extraction
css selector extraction
regex extraction (for starters)
json extraction
glob extraction

For example, extract a list of text, and convert all of them into lists.

However, what if we want to extract a list of items (objects)? An example is a list of products (search results).
One way to model it is to use nested extraction rules.
For example, use a css selector to select all <li> elements, then use css selectors to query for title and url and description, resulting in a list of objects.

It should also be possible to combine multiple selectors together and merge them into the list of items. For example, what if the search results are split into 2, and require two different selectors? or what each selector returns empty on certain page states? This allows for more parsing flexibility.

And what if we want to select different types of items that are present on each page? Then we would need multiple different sets of extraction rules, one for each type, and tag each parsed item with the corresponding type.

%Extractor{} that defines the extraction method
Item extraction - a list of fragment extractors with a nested list of attribute extractors, with each attribute having an extractor, attr key. limit to 1 level for now. list extractors -> attribute extractors. Tag each item with a item_type
request extraction - a list of extractors, where text extracted is converted into urls.

ziinc / crawldis-old Goto Github PK

crawldis-old's People

Contributors

Stargazers

crawldis-old's Issues

22 Apr Updates

24 Apr Update

What is a spider

Recommend Projects

Recommend Topics

Recommend Org