Giter Site home page Giter Site logo

How to do paging with scan-parallel? about faraday HOT 15 OPEN

ulsa avatar ulsa commented on August 18, 2024
How to do paging with scan-parallel?

from faraday.

Comments (15)

ptaoussanis avatar ptaoussanis commented on August 18, 2024

Hi Ulrik, sorry for the delay responding to this.

Just to clarify: have you used :limit successfully with scan and are having trouble getting it to work with scan-parallel specifically, or you aren't sure how to use :limit in general and you happen to be using scan-parallel?

I haven't tested it (don't have any db creds with me atm) - but I can't think of a reason why :limit shouldn't work with scan-parallel. It's just a thin scan wrapper to help handle the segment args automatically:

(defn scan-parallel
  "Like `scan` but starts a number of worker threads and automatically handles
  parallel scan options (:total-segments and :segment). Returns a vector of
  `scan` results.

  Ref. http://goo.gl/KLwnn (official parallel scan documentation)."
  [creds table total-segments & [opts]]
  (let [opts (assoc opts :total-segments total-segments)]
    (->> (mapv (fn [seg] (future (scan creds table (assoc opts :segment seg))))
               (range total-segments))
         (mapv deref))))

As for using :limit + :last-prim-kvs: any time you see prim-kvs in a docstring/arg-name it means an argument of form {<hash-key> <val>} or {<hash-key> <val> <range-key> <val>} - i.e. the same form used by get-item, etc.

So to implement paging you'd want to do something like this [untested, don't have a db with me]:

(scan creds :my-table {:limit 2 :attr-conds {:age [:in [24 27]]}})
=> [{:age 24, :name \"Steve\"} {:age 27, :name \"Susan\"}]
(scan creds :my-table {:last-prim-kvs {:age 24 :name \"Susan\"} :attr-conds {:age [:in [24 27]]}})

Does that help?

from faraday.

ulsa avatar ulsa commented on August 18, 2024

I am using :limit and paging successfully with scan, but I can't understand how to do it with scan-parallel. Or is it perhaps so that paging is not possible with scan-parallel, because the order is not predictable or something?

I have around a million entries that I want to process, and I don't want to read them all into memory at once. I'm currently using scan with :limit and paging, processing a batch at a time. However, I have trouble reaching the provisioned limits using scan, so I figured I could use scan-parallel. But perhaps it's not designed to support paging.

from faraday.

ptaoussanis avatar ptaoussanis commented on August 18, 2024

I am using :limit and paging successfully with scan, but I can't understand how to do it with scan-parallel

Sorry I don't have any test dbs on hand atm - it'd help if you could be a little more specific. Are you seeing an error when you replace scan with scan-parallel as in the example I provided above?

Or is it perhaps so that paging is not possible with scan-parallel

It should be possible. Unless I'm misunderstanding something about what you're trying to do - it should literally be as simple as replacing scan with scan-parallel in your call. No args need to change. Nothing about your methodology needs to change. It should work as a drop-in replacement. What happens when you do that?

from faraday.

ulsa avatar ulsa commented on August 18, 2024

I didn't want to provide lots of details if I was completely misunderstanding the functionality of scan-parallel, but if you say that paging should work, then let's press on. I'll give you details soon. Meanwhile, consider this:

scan-parallel just passes the given opts on to each underlying scan, with the corresponding :segment number added on, right? If I want to send :last-prim-kvs as opts, like I did when I was doing paging with just scan, then how should I specify those? Each segment needs its own starting point, but as far as I can understand, I can only specify a single :last-prim-kvs. Which segment does that go to? The first? What about the other segments? It just doesn't make sense to me.

from faraday.

ulsa avatar ulsa commented on August 18, 2024

I am using :limit and paging successfully with scan:

(scan creds :my-table {:limit 1})

This will give me a vector containing the first page of entries (the actual number depends, in my case 5):

[{:id 1, :x "a"} {:id 2, :x "b"} {:id 3, :x "c"} {:id 4, :x "d"} {:id 5, :x "e"}]

In the next call, I set :last-prim-kvs to {:id 5}, to indicate that I want the scan to start after that id:

(scan creds :my-table {:last-prim-kvs {:id 5} :limit 1})

This will give me the next page of entries:

[{:id 6, :x "f"} {:id 7, :x "g"} {:id 8, :x "h"} {:id 9, :x ""} {:id 10, :x "i"}]

I can't understand how to do it with scan-parallel. The first call is obvious, though. I'm requesting two segments:

(scan-parallel creds :my-table 2 {:limit 1})

This will give me a vector of size 2, where each element is a vector containing some page of entries, not necessarily page one and two:

[
 [{:id 1, :x "a"} {:id 2, :x "b"} {:id 3, :x "c"} {:id 4, :x "d"} {:id 5, :x "e"}]
 [{:id 16, :x "s"} {:id 17, :x "r"} {:id 18, :x "k"} {:id 19, :x "q"} {:id 20, :x "p"}]
]

What about the subsequent calls for the remaining pages? How do I specify :last-prim-kvs? If I do it like with scan, I get this error:

user=> (scan-parallel creds :my-table 2 {:last-prim-kvs {:id "5"} :limit 1})
AmazonServiceException The provided starting key is invalid: Invalid ExclusiveStartKey. 
Please use ExclusiveStartKey with correct Segment. TotalSegments: 2 Segment: 1  
com.amazonaws.http.AmazonHttpClient.handleErrorResponse (AmazonHttpClient.java:679)

You're saying that scan-parallel handles the segment args automatically, but the :last-prim-kvs will be different for each segment. I can see that it could deduce which :last-prim-kvs should go to which segment, if I could pass a vector of :last-prim-kvs maps, but I don't seem to be able to pass a vector. And besides, the pages are not deterministic, it seems, so I fear that scan-parallel can not be used with paging.

from faraday.

ptaoussanis avatar ptaoussanis commented on August 18, 2024

Hi, closing this - assuming it's gone stale?

from faraday.

ulsa avatar ulsa commented on August 18, 2024

I couldn't get it to work, but I still don't know if I did something wrong or if there is something missing in faraday.

from faraday.

ptaoussanis avatar ptaoussanis commented on August 18, 2024

Yeah, sorry - I'm actually not using DynamoDB myself at the moment. Not sure off hand, and don't have any test dbs handy to look into this quickly. Would need to spend some proper time to dig into the DDB docs + API to confirm: may be a DDB limitation, or a Faraday limitation that needs fixing.

Will reopen in case I do find some time in future, or someone else has some input.

Really sorry to leave you hanging on this, wasn't intentional.

from faraday.

ptaoussanis avatar ptaoussanis commented on August 18, 2024

Quick Google yielded this: http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html

In a parallel scan, a Scan request that includes ExclusiveStartKey must specify the same segment whose previous Scan returned the corresponding value of LastEvaluatedKey.

So it seems like parallel scans should be pageable, but Faraday's scan implementation would need some work to allow this to be automatic. Have made a TODO note in the code though realistically don't think I'll personally have time to look into this near-term.

You may be able to use scan directly and feed it the necessary parallel segment info; not sure how tricky that'd be to do.

PRs super welcome if you (or anyone else) feels like taking a stab at this!

Cheers :-)

from faraday.

barkanido avatar barkanido commented on August 18, 2024

Is this something worth fixing for a Faraday noob? Are you open for reviewing PR on this? Any thoughts about a reasonable solution direction?

from faraday.

kipz avatar kipz commented on August 18, 2024

I think it would be nice to get this fixed @barkanido given we have the beginning of an implementation, so I expect PR's would be welcomed by the community.

Having said that however, you could probably just manage the threads and paging yourself and just call scan directly, and personally, that's the approach I'd prefer here.

I have an implementation of a lazy-paged-query that will manage query paging automatically, and I'm glad that it was easy to build on top of Faraday, but don't feel that it should be part of the API. Handling thread-pools and paging of large result-sets of scans fits into the same category for me. Of course I don't speak for the community here at all though, it's just my opinion!

from faraday.

barkanido avatar barkanido commented on August 18, 2024

@kipz fair enough. Maybe your example deserve a place here as an example people can refer to. Or even in the README. Just a thought. Anyway I was just looking for a way to contribute and found this issue. Maybe a task from the TODO is of higher priority?

from faraday.

kipz avatar kipz commented on August 18, 2024

@joelittlejohn what are your thoughts on all this?

from faraday.

joelittlejohn avatar joelittlejohn commented on August 18, 2024

@barkanido Re your question about whether this ticket is a good one for a Faraday noob to tackle, it's probably not 🙂 The existing paging implementation is one of the most complex parts of Faraday and as @kipz mentions people have often found that they prefer to avoid the paging feature altogether and implement their own solution (over which they have more control) outside Faraday.

Is this a feature you need or were you just interested in making a useful contribution? I think the most useful thing to be done for Faraday is better documentation. Better docstrings and/or I think it would be very useful to have a list of examples that show real-world usage covering all typical ways to use Faraday's functions.

from faraday.

joelittlejohn avatar joelittlejohn commented on August 18, 2024

For a Faraday noob that wants to contribute something useful, I recommend using the library for a while on a few projects and over time you will inevitably uncover something you'd like but is missing.

from faraday.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.