ziinc / crawldis-old Goto Github PK
View Code? Open in Web Editor NEWThis project forked from elixir-crawly/crawly
License: Apache License 2.0
This project forked from elixir-crawly/crawly
License: Apache License 2.0
Not supported. Opt for puppeteer/playwright for JS sites.
Middlewares should be piped per request just before executing the fetch. At this stage, we can drop the request as needed.
ParsedItem is a little ambiguous. Need to reduce usage of the term "item" as well, as it is very confusing.
Once a crawl job has started #4 , initial requests from Web
will be passed to the Requestor pool leader.
A few ways to handle crawling:
delta crdt for fast distributed cache syncing looks promising
https://hexdocs.pm/delta_crdt/DeltaCrdt.html#start_link/2
Decided to experiment with DeltaCRDT for data syncing across nodes. We are able to also implement a separate storage mechanism for the CRDT as well, which opens the doors to using ETS for storing data in memory.
Crawly.Worker.get_response/1
). Crawly.Worker performs full-blown request-response processing, which is not what we want for the Requestor's implementation, which should only fetch the response and parse it.Requestor also needs to be able to store the crawl's config.
Request queuing requirements: each request needs to be queued in a distributed fashion while ensuring that Requestors do not do duplicate work. as such, good way to minimize overlap is to allow Requestors to "claim" jobs before actually doing any work on them. Requestors can only claim unclaimed requests.
So the lifecycle of a request in the queue is:
unclaimed -> claimed -> popped
When popped from the queue, it no longer appears in the queue.
Internal state held in the crdt is "https://www...." => {:unclaimed, %Request{...}
.
Middlewares, such as retrying logic, is optional. internally, all crawly middlewares should be optional. Hence we wil omit them for now.
Centralized crawling management, connected to all nodes. Each node has a process that interfaces with the management node(s)
Each job starts a certain number of Requestors & Processors.
v1
v2
Can CRUD the spider.
Spiders specify a specific page parsing pattern. E.g. extract all urls with this glob pattern, then extract all text with this xpath.
Spiders are made up of a group of selected Parser modules and a configuration for each. They will perform the parsing of new Requests and ParsedItems.
For web, job management is linked to actual crawl jobs created from in cluster.
Consider using oban to manage persistent crawls. crawl jobs on cluster is not persistent, cluster has no persistence layer. on the other hand, web management should have a persistence layer for more possible functionality, like storing history etc.
v1:
v2:
if piping through the fetcher and the fetcher returns {false, state}
, then should drop the request.
Config - Parsing One way of viewing parsing config is through modules, each extracting text. This text can then either be converted into new requests, or passed on as parsed items. Right now, i can think of a few:
xpath extraction
css selector extraction
regex extraction (for starters)
json extraction
glob extraction
For example, extract a list of text, and convert all of them into lists.
However, what if we want to extract a list of items (objects)? An example is a list of products (search results).
One way to model it is to use nested extraction rules.
For example, use a css selector to select all <li>
elements, then use css selectors to query for title
and url
and description
, resulting in a list of objects.
It should also be possible to combine multiple selectors together and merge them into the list of items. For example, what if the search results are split into 2, and require two different selectors? or what each selector returns empty on certain page states? This allows for more parsing flexibility.
And what if we want to select different types of items that are present on each page? Then we would need multiple different sets of extraction rules, one for each type, and tag each parsed item with the corresponding type.
method
item_type
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.