Giter Site home page Giter Site logo

Comments (8)

dwinston avatar dwinston commented on July 22, 2024 1

I recently added a get_criteria helper to support more robust incremental building, and I use it for XAS building in emmet by subclassing MapBuilder. Additional indexes are required, but builder runs should be more resilient to interruption. A builder with a get_items that uses get_criteria should be amenable to @mkhorton's suggestion for the runner to catch cursor timeout errors and re-call the builder's get_items. The basic idea is that last-updated filtering is done at the document (uniquely identified by key) rather than collection/store level -- this is why additional (compound) indexes are needed on the stores. @montoyjh I'd be particularly interested if this approach works for your command-cursor-based builder(s) because, as you mentioned, guarding against timeouts is more involved.

from maggma.

montoyjh avatar montoyjh commented on July 22, 2024

Yep this is occasionally an issue for me as well. It's also a bit of a pain if you use a command cursor (e. g. aggregation pipelines), because it's a bit harder to ensure that command cursors don't time out.

from maggma.

mkhorton avatar mkhorton commented on July 22, 2024

Not sure if an alternative option might be to just re-run get_items() ? If the builder has been written appropriately it should be harmless to just generate a new cursor.

from maggma.

montoyjh avatar montoyjh commented on July 22, 2024

Would that not break incremental builds? Presumably if documents have been added to a target collection with the current timestamp, the last_updated filters would filter out all of the documents included in the initial cursor upon re-execution. Also, it seems like most cursor timeouts would probably recur.

from maggma.

mkhorton avatar mkhorton commented on July 22, 2024

Yeah, I see that. It depends how the incremental builds are structured I guess -- but in this case it also means that any interrupted incremental build will mean that a complete re-build will have to be performed? What if a build in a cron job fails silently one day? This seems like a recipe for something bad to happen.

You could store the initial cursor query, but that feels a bit hacky (and wouldn't prevent the interrupted build issue either, just cursor timeouts).

from maggma.

montoyjh avatar montoyjh commented on July 22, 2024

Agreed, incremental builds from date filters have this general issue at present (this has been an issue for a long time). I've thought a bit about how to make this more robust, but haven't implemented anything yet. One thought I've had would be to build in a temporary staging collection and then copy to the target collection, which would presumably only fail if the copying step was interrupted (rather than the process_item->update targets pipeline). This puts a bit more of a space load on the database, though, and could still fail if the final copying step is interrupted.

I think there's probably a more effective strategy via some additional logic in the build bookkeeping or a validation step, perhaps all of the new documents in a collection have an "incremental_validation" field or something that only gets added collection-wide when an incremental build completes.

from maggma.

mkhorton avatar mkhorton commented on July 22, 2024

Well, MongoDB 4.0 supports multi-document ACID transactions, so in principle you could wrap all the update_targets into a single session/transaction. This is a very Mongo-specific thing however, so probably wouldn't play nicely with maggma's Stores abstraction(?)

I've thought about a secondary book-keeping collection too, which has the benefit of making it easy to display builder analytics, but that's another layer we'd have to implement.

I agree a staging collection could work, but is a bit unfortunate -- i.e. will the staging collection need to be cleaned/emptied automatically in the case of failed builds? What happens if someone specifies the wrong staging collection and accidentally wipes data? etc. I prefer the "incremental_validation" field idea, in that it's easiest to implement and quite lightweight, or perhaps a builder_in_progress field that gets set to True at the start and False when it completes gracefully... Still not a big fan though but might be the best we can do right now :/

from maggma.

shyamd avatar shyamd commented on July 22, 2024

We have to actively deal with this unfortunately by making sure our get_items doesn't stall between cursor calls. The best way is to either group by key, get data from a new cursor, and yield this data in get items. Not pretty, but pymongo doesn't give us much of a choice.

from maggma.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.