Giter Site home page Giter Site logo

crawldis-old's People

Contributors

avillen avatar cybernet avatar darkslategrey avatar defp avatar edgarlatorre avatar feng19 avatar filipevarjao avatar gfrancischelli avatar harlantwood avatar jerojasro avatar juanbono avatar kianmeng avatar kylekermgard avatar maiphuong-van avatar mgibowski avatar michaltrzcinka avatar ogabriel avatar oltarasenko avatar oshosanya avatar rootkc avatar spectator avatar torifukukaiou avatar vermaxik avatar ziinc avatar

Stargazers

 avatar  avatar

crawldis-old's Issues

can see details for a job

  • able to see crawl duration, start timestamp, stop timestamp
  • able to see parsing config
  • able to see start urls

app/requestor: Can receive a Request from Web and begin crawling

Once a crawl job has started #4 , initial requests from Web will be passed to the Requestor pool leader.

A few ways to handle crawling:

  • Request delegating: The requestor leader handles request delegation, enqueuing request to specific requestors based on their queue sizes.
    • drawbacks: if the node goes down, all enqueued requests for that node is lost
  • Central cache: Requests are stored on a central cache, and this cache is used to store all requests. Requestors then pop off new requests to crawl.
    • drawbacks: if the node goes down, all requests for the cluster is lost
  • Distributed cache: using a crdt, we can have a distributed cache that allows all requestors to store a copy of the cache, allowing data redundancy and fault tolerance.
    • drawbacks: requests may be crawled multiple times, due to the non-atomic replication.
  • Distributed cache with consensus crawling: use a crdt to cache the requests, but use a consensus algo to determine the crawl sequence.
    • drawback: determining consensus would be slow, would not scale for fast crawling requirements.

delta crdt for fast distributed cache syncing looks promising

https://hexdocs.pm/delta_crdt/DeltaCrdt.html#start_link/2

22 Apr Updates

Decided to experiment with DeltaCRDT for data syncing across nodes. We are able to also implement a separate storage mechanism for the CRDT as well, which opens the doors to using ETS for storing data in memory.

  • able to store a request to the req storage worker (currently uses crawly's genserver state storage mechanism)
  • able to fetch the response (using Crawly.Worker.get_response/1). Crawly.Worker performs full-blown request-response processing, which is not what we want for the Requestor's implementation, which should only fetch the response and parse it.
  • #7

Requestor also needs to be able to store the crawl's config.

  • using the same Crawly mental model, we can think of a crawl's config comprising of a spider's start urls, the parsing logic (request and parsed item extraction), and the parsed item processing logic.
  • Technically, if we don't want to process the parsed item, then an empty parsed-item-processing config would skip over the Processor.
  • This means that only the start urls and parsing config is

24 Apr Update

Request queuing requirements: each request needs to be queued in a distributed fashion while ensuring that Requestors do not do duplicate work. as such, good way to minimize overlap is to allow Requestors to "claim" jobs before actually doing any work on them. Requestors can only claim unclaimed requests.

So the lifecycle of a request in the queue is:

unclaimed -> claimed -> popped

When popped from the queue, it no longer appears in the queue.

Internal state held in the crdt is "https://www...." => {:unclaimed, %Request{...}.

  • can queue a request
  • can claim a request
  • can pop a request
  • can replicate the queue across nodes

Middlewares, such as retrying logic, is optional. internally, all crawly middlewares should be optional. Hence we wil omit them for now.

jobber: crawl jobs should hold config state across nodes

Centralized crawling management, connected to all nodes. Each node has a process that interfaces with the management node(s)

Each job starts a certain number of Requestors & Processors.

v1

  • Run on same node with cluster

v2

  • should broadcast stats to management node
  • scaling requestors/processors with monitoring (linear increase/ buffer/ )
  • warning alerts
  • web api

app/web: Can create a spider

Can CRUD the spider.

What is a spider

Spiders specify a specific page parsing pattern. E.g. extract all urls with this glob pattern, then extract all text with this xpath.

Spiders are made up of a group of selected Parser modules and a configuration for each. They will perform the parsing of new Requests and ParsedItems.

web: can initialize immediate crawl jobs

  • A crawl job is the scope of parsing work performed by a spider. All requests and parsed items will be linked to a crawl job.
  • Each job is given a job id.

For web, job management is linked to actual crawl jobs created from in cluster.

Consider using oban to manage persistent crawls. crawl jobs on cluster is not persistent, cluster has no persistence layer. on the other hand, web management should have a persistence layer for more possible functionality, like storing history etc.

v1:

  • see all running jobs
  • start/stop a job

v2:

  • see stats for a job
  • historical introspection
  • restarting
  • arguments
  • long running jobs
  • timeouts
  • scheduling

app/requestor: parse a response using a variable configuration.

Config - Parsing One way of viewing parsing config is through modules, each extracting text. This text can then either be converted into new requests, or passed on as parsed items. Right now, i can think of a few:

xpath extraction
css selector extraction
regex extraction (for starters)
json extraction
glob extraction

For example, extract a list of text, and convert all of them into lists.

However, what if we want to extract a list of items (objects)? An example is a list of products (search results).
One way to model it is to use nested extraction rules.
For example, use a css selector to select all <li> elements, then use css selectors to query for title and url and description, resulting in a list of objects.

It should also be possible to combine multiple selectors together and merge them into the list of items. For example, what if the search results are split into 2, and require two different selectors? or what each selector returns empty on certain page states? This allows for more parsing flexibility.

And what if we want to select different types of items that are present on each page? Then we would need multiple different sets of extraction rules, one for each type, and tag each parsed item with the corresponding type.

  • %Extractor{} that defines the extraction method
  • Item extraction - a list of fragment extractors with a nested list of attribute extractors, with each attribute having an extractor, attr key. limit to 1 level for now. list extractors -> attribute extractors. Tag each item with a item_type
  • request extraction - a list of extractors, where text extracted is converted into urls.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.