materialsproject / maggma Goto Github PK
View Code? Open in Web Editor NEWMongoDB aggregation machine
Home Page: https://materialsproject.github.io/maggma/
License: Other
MongoDB aggregation machine
Home Page: https://materialsproject.github.io/maggma/
License: Other
I don't know if this is still an issue / was solved, but I recall folks like @montoyjh mentioning that builder runs mysteriously stalled for a long time without seemingly doing any process_item
/update_targets
work. I came across the following in an email list to which I subscribe, and I thought it might be applicable to our MultiprocProcessor
:
So, how is
multiprocessing.Pool
broken? The way it works is to start subprocesses by using the POSIX (aka Unix) fork() syscall, which clones all the memory in the process. Typically you then call some variant of execv(), which replaces the copy with a completely new process, but not multiprocessing.Pool. So, when you start a subprocess like this it’s a copy of everything in the parent Python process.For example, any state stored in a Python module will be copied. I’ve seen child processes in the pool get really confused and try to rotate the parent process' logs, because the Python standard library logging module stores some state about log handling and that gets copied wholesale.
Also, any threads in the parent process are now missing in the forked copy. If your library starts some threads in the background your library will be broken in the subprocess (it’s obscure, but I maintain a library that actually does this). More commonly, thread locks might end up in a bad state, which is my guess for what happened in my case.
In short, the end result is a pool of processes that work some of the time—but sometimes break mysteriously. Python does support a process pool that doesn’t do this horrible fork()-only trick, but you need to know it’s there, and you need to know that the default is broken.
The key idea is to override the context's default "fork" start method. It's the default, but the docs state that "safely forking a multithreaded process is problematic." I know MultiprocProcessor
starts a separate thread for updating targets, and uses locks for synchronization, so there may be something screwy happening that can be avoided e.g. by setting the start method to be "forkserver" if supported on the host OS, and "spawn" otherwise.
I was about to add a time limit to BVAnalyzer since @shyamd mentioned some materials were taking too long to complete. However, this seems like it may be a common issue that some analysis tasks just take too long (perhaps pathologically so) for certain materials. A more general solution might be to add a time limit (and time limit kwarg) to process_items, possibly just in MapBuilder itself.
For some reason running a builder in jupyter causes duplicate messages to print
Add a summary warning when target documents are newer than the source with the output triggered on verbosity flag.
the behavior of Store.update
should be modified to return the keys for the docs it was able to successfully update.
The boto
package used by the S3Store
to access MinIO generates a very large amount of DEBUG output, making it difficult to see the output from maggma itself.
Merge these two stores into one as the 2nd is just an init interface.
See example below for context. Not sure if this is intended behavior or a bug, since forcing a reconnect fixes the issue.
from maggma.stores.mongolike import MongoStore
store = MongoStore(<credentials/etc here>)
store.connect()
store.query_one({}) # This works fine and has no issues
store.close()
# Since MongoStore.close() does not clear the MongoStore._collection variable, calling connect() does nothing
store.connect()
# This query is fine if SSHTunnels are not in use, since the connection to the DB is
# automatically reopened by MongoClient. If SSHTunnel is used, the connection to
# the not work, since the SSHTunnel connection was already closed
test_doc = store.query_one({})
# This works, as the connection to the SSHTunnel is re-established
store.connect(force_reset=True)
test_doc = store.query_one({})
In theory there are no major issues that should prevent this, but in practice there seem to implementation problems with annotations. Need to figure out how to properly import futures
in 3.6 to get the fully compatibility necessary.
We need some manner of reporting from builders. This needs to be two-tiered. One set of reporting capabilities that come from mrun
and one set that can be explicitly provided by the Builder:
mrun
reporting:
Not sure what else to report.
Vault has done a lot upgrade it's capabilities since the vault store was first created. It would be nice to rebuild our VaultStore so that we can replace mongogrant with all its fickleness and brittleness with Vault.
Overall goals
VaulStore
class that encapsulates all the interaction with the Vault Server, including potentially storing the Vault access tokenStore
, IE Atlas and MongoDB credentials become a MongoStore
, and don't store authentication informationvault_store.query
, not vault_store.connected_store.query
etc.Another idea to think about:
How do we make it so that builders can also define the ontology? Is this just the pydantic model or can we have it define the rules for how something maps to another. That is can we define the connections a builder is supposed to create in a standard ontology language without making this onerous.
Right now distributed running requires a hardcoded TCP IP + Port. It would be nice if mrun
could auto-find an open port in user-space to use and tell the user if requested.
MSONable
is great but has a couple flaws:
BaseModel
for defining the data modelas_dict
and from_dict
.We should start thinking about a new MSONable
defined here and then later if desired moved back into monty.
The MongoStore does not provide an insert
option, only an update
option. This does a bulk_write
but as a list of ReplaceOne
due to how maggma deals with key handling, which presumably(?) is a lot slower than just a list of insert ops.
I'm building a large number of documents but I've dropped down to using store.collection.insert_many()
since the update
seems prohibitive.
Is the lack of insert intentional? Has this been problematic for others?
It's a little ambiguous on how to set up the S3 store to have some subset of search data in the index. This is partly because the keys for creating the S3 Object and the index keys are the same. We should make an init attribute in the S3Store: metadata
or something that the user supplies for fields they want to be able to search on. The S3Store
should use the keys
field to determine uniqueness when it looks for a doc in the index DB to update. This will make it so that the naming of the blob in S3 can just be something like an OID.
Considering attaching validators to builder instead of store, i.e. a doc_validator
, initially only in MapBuilder
, would be called inside MapBuilder
's process_item
and schema
then added as a kwarg to the MapBuilder init; this should be more performant.
MapBuilder
, GroupBuilder
, and ProjectionBuilder
all have this ensure index warning that is not a good description of what actually went "wrong". It basically warns that one double index is missing when in fact the check is for all relevant indexes: key, last_updated and state; these are also not double indexes. Would be good to update the warning. Here is the linked code:
maggma/src/maggma/builders/map_builder.py
Lines 79 to 86 in 672c6ae
The current CLI throws ValueError: Set of coroutines/Futures is empty.
when forgetting to pass in builders either as a python file or as a JSON. We should properly catch this and throw an error the user can use.
The LU translator functions don't json serialize so the Store no longer dumps to json if you use one of them. I'll leave it up to you to decide how you want to deal with this. I'd think we might just have key type strings and the stores just translate to a standard internal format?
I know that we've discussed this at least once before, but I think we should reconsider the argument order for criteria and properties for Store.query and Store.query_one. I think criteria should be first and properties should be second. As far as I can tell, the main reason we have properties first is because QueryEngine used to, and as far as I can tell no one who is building dbs for MP using maggma has much experience with or love for QueryEngine.
Basically, I want to be able to teach people how to use maggma by appealing to their knowledge of the MPRester query (or if they've got experience with pymongo, pymongo.Collections.find) by saying "this is basically the same as those things", rather than "this is the same as those things, except the argument order is reversed to keep in line with some other thing we used a long time ago".
Is there anyway we could build in a global progress bar for builders that also worked well with logger/input outputs and didn't require complex logic?
I believe it'll either error out (BulkWriteOp error) or lead to duplicate documents, depending on if object _id has been serialized or not.
Add the ability to include custom reporting from builders for reporting store
Using mrun
as described in the docs can fail if the Builder
is not serialized first.
If I instantiate a Builder
inside a jupyter notebook and assign it to the __builder__
variable:
__builder__ = MaterialsBuilder(my_store, target_store, query={"tags":"my_favorite_tag"})
and then execute mrun test_builder.ipynb
, I get
Traceback (most recent call last):
File "/Users/ryan/miniconda3/envs/mp2/bin/mrun", line 8, in <module>
sys.exit(run())
File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/maggma/cli/__init__.py", line 80, in run
asyncio.run(multi(builder, num_workers))
File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/asyncio/runners.py", line 43, in run
return loop.run_until_complete(main)
File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/asyncio/base_events.py", line 612, in run_until_complete
return future.result()
File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/maggma/cli/multiprocessing.py", line 158, in multi
builder.connect()
AttributeError: 'dict' object has no attribute 'connect'
However, if I serialize the builder first:
dumpfn(__builder__, 'builder_test.json')
and then run mrun builder_test.json
, all works as expected.
Expected behavior
mrun
should work when called on a python file or jupyter notebook directly, even if the builder defined in that file has not been serialized.
Anybody else having this issue? Seems to be if a set of process_items takes too long -- it's possible to pass the no_cursor_timeout=True
kwarg to Mongolike.query
but I'm not sure if that should be done by default.
I'm not sure what the default timeout is, but we could also catch this exception with a recommendation to user on how to increase the timeout...
Edit: Also I'm not sure if the server will respect an increased timeout even if we set it in maggma.
This is a note for myself: mpi4py is broken on Cori... again. It seems to be a very fickle library so it would be ideal to move to something like zeromq with auto-handshaking to determine the master/worker designations will be a much more robust method for multi-node scaling.
Need a MongoStore.from_pymongo collection
I appreciate they're kwargs anyway, but is there a reason their order is reversed compared to pymongo? Think this might easily confuse people. (Also why query
over find
? for same reason)
There needs to be a Vault enabled Mongo Store so that credentials can be aquired on the fly
To avoid using hidden attribute store._collection.aggregate
Alternatively, perhaps a pipeline
kwarg instead on query
Add MinIO to the github actions tests and run full integration tests against this
maggma
depends on pynng
for a single import Pair1
:
maggma/src/maggma/cli/distributed.py
Lines 8 to 16 in 9e037d7
pynng
causes pip install
errors on M1 Macs (tracked in codypiersall/pynng#89). Also, see related issue hackingmaterials/atomate#728.
What's your take on maintaining the functionality of Pair1
as part of maggma
itself? Else I would try help fix the error upstream.
The SSH tunnel feature in MongoStore
is really nice in python terminal or jupyter-notebook. Ideally, we could serialize the connection and have MongoStore initialize it when deserialized.
Should be able to run a builder directly using a builder.run() command under serial processing.
Another bare try/except!
Since i dont have much experience doing the building stuff, could someone explain:
The drawbacks in the design of the current builder in pymatpro
what new features(other than mpi support and dependency specification capabilities) must the new builder support? As far as i understand the one in pymatpro implements the producer consumer model using the multiprocessing package.
It would help my understanding a lot if someone could give me a concrete fully sketched out example of a builder with complicated dependencies.
What package dependencies are allowed? for example the current get_db function in the helpers module is a reimplementaion of the function in matgendb.utils module. So should we use those from matgendb package or make everything self-contained?
Thanks
The _collection
property of GridFSStore does not work with confirm_field_index
which gets called when newer_in
is used.
Default key should probably be _id that is dealt with transparently the same way mongodb does.
It would be nice to have a convenient juypter based runner to perform regular and multiprocessing.
Often we want to parallelize non-module Builders. It would be nice to enable this for at minimum serial and multiprocessing.
AttributeError: 'NoneType' object has no attribute 'find'
Doubly-confusing because you're calling Store.query()
, which is aliased to collection.find()
...
For @shyamd
Currently, the semaphore for update_targets in the runner causes the entire program to wait for the whole update operation to complete. Decouple this by initiating copying data from the dequeue and then initiating a thread to update targets.
Should from tqdm import tqdm_notebook as tqdm
when running inside Jupyter to avoid writing a billion progress bars. I'm not sure the cleanest way to check if running inside Jupyter however -- there seem to be a few options, none of them great.
Ideally the Store abstracts away database specific commands. get_criteria
calls pymongo calls directly. It would be better if it called Store
methods and the Store
s used the correct calls internally.
There are a lot of key capabilities that need to be wrapped into some class such as query engine that implements find and return capabilities agnostic of find, distinct, etc.
@montoyjh suggested this during my group presentation on Monday. Examples:
Builder.get_items
returns a large list rather than e.g. a cursor.Builder.update_items
doesn't do a bulk operation e.g. via Store.update
.Since I took out the MPI implementation, I need to build a ZeroMQ based multi-node Executor
to run distributed processing.
Add a feature to preserve the last_updated data in MapBuilder.
This is usefull for derived data collections where we lazily just recompute and update everytime we run that builder rather than only updating new things.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.