Comments (8)
I recently added a get_criteria
helper to support more robust incremental building, and I use it for XAS building in emmet
by subclassing MapBuilder
. Additional indexes are required, but builder runs should be more resilient to interruption. A builder with a get_items
that uses get_criteria
should be amenable to @mkhorton's suggestion for the runner to catch cursor timeout errors and re-call the builder's get_items
. The basic idea is that last-updated filtering is done at the document (uniquely identified by key
) rather than collection/store level -- this is why additional (compound) indexes are needed on the stores. @montoyjh I'd be particularly interested if this approach works for your command-cursor-based builder(s) because, as you mentioned, guarding against timeouts is more involved.
from maggma.
Yep this is occasionally an issue for me as well. It's also a bit of a pain if you use a command cursor (e. g. aggregation pipelines), because it's a bit harder to ensure that command cursors don't time out.
from maggma.
Not sure if an alternative option might be to just re-run get_items()
? If the builder has been written appropriately it should be harmless to just generate a new cursor.
from maggma.
Would that not break incremental builds? Presumably if documents have been added to a target collection with the current timestamp, the last_updated filters would filter out all of the documents included in the initial cursor upon re-execution. Also, it seems like most cursor timeouts would probably recur.
from maggma.
Yeah, I see that. It depends how the incremental builds are structured I guess -- but in this case it also means that any interrupted incremental build will mean that a complete re-build will have to be performed? What if a build in a cron job fails silently one day? This seems like a recipe for something bad to happen.
You could store the initial cursor query, but that feels a bit hacky (and wouldn't prevent the interrupted build issue either, just cursor timeouts).
from maggma.
Agreed, incremental builds from date filters have this general issue at present (this has been an issue for a long time). I've thought a bit about how to make this more robust, but haven't implemented anything yet. One thought I've had would be to build in a temporary staging collection and then copy to the target collection, which would presumably only fail if the copying step was interrupted (rather than the process_item->update targets pipeline). This puts a bit more of a space load on the database, though, and could still fail if the final copying step is interrupted.
I think there's probably a more effective strategy via some additional logic in the build bookkeeping or a validation step, perhaps all of the new documents in a collection have an "incremental_validation" field or something that only gets added collection-wide when an incremental build completes.
from maggma.
Well, MongoDB 4.0 supports multi-document ACID transactions, so in principle you could wrap all the update_targets into a single session/transaction. This is a very Mongo-specific thing however, so probably wouldn't play nicely with maggma's Stores abstraction(?)
I've thought about a secondary book-keeping collection too, which has the benefit of making it easy to display builder analytics, but that's another layer we'd have to implement.
I agree a staging collection could work, but is a bit unfortunate -- i.e. will the staging collection need to be cleaned/emptied automatically in the case of failed builds? What happens if someone specifies the wrong staging collection and accidentally wipes data? etc. I prefer the "incremental_validation" field idea, in that it's easiest to implement and quite lightweight, or perhaps a builder_in_progress
field that gets set to True at the start and False when it completes gracefully... Still not a big fan though but might be the best we can do right now :/
from maggma.
We have to actively deal with this unfortunately by making sure our get_items
doesn't stall between cursor calls. The best way is to either group by key, get data from a new cursor, and yield this data in get items. Not pretty, but pymongo doesn't give us much of a choice.
from maggma.
Related Issues (20)
- GridFSStore collection has no database attribute HOT 1
- Objects not returned from S3Store.query() HOT 3
- MemoryStore __eq__ does not behave as expected HOT 4
- `MontyStore` cannot be used with a pre-existing local DB HOT 4
- python 3.11 CI test failure with AzureBlobStore
- `database_name` of `MontyStore` doesn't seem to update the name HOT 2
- Scan for API keys on PR HOT 3
- Instantiating a `Store` from a dictionary representation HOT 3
- JSON representation for `MontyStore` cannot be decoded with `monty.json.MontyDecoder` HOT 3
- Feature Suggestions: Additional local data stores, e.g. `MongitaStore`, `FerretDB` HOT 3
- Enhancement: more performant MemoryStore HOT 4
- Enhancement: Locking mechanism for file-based stores HOT 3
- Removing the requirement to use a `task_id` HOT 5
- Would the maggma docs be a good place to host MongoDB setup instructions? HOT 2
- Support for Pydantic 2 HOT 2
- [Feature Request]: Is there a specific reason why pyzmq is fixed to 24.0.1 rather than supporting more recent versions ? HOT 3
- Update README/docs to better reflect the purpose of Maggma HOT 1
- `DeprecationWarning` associated with `pkg_resources`
- Drop python 3.8
- [Feature Request]: Leverage optional dependency groups to reduce dependency count HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from maggma.