The maggma from materialsproject

Query Engine equivalent

There are a lot of key capabilities that need to be wrapped into some class such as query engine that implements find and return capabilities agnostic of find, distinct, etc.

This is important to wrap all the different way data can be accessed, e.g. direct mongo or gridFS
Implement the same flat properties that pymatgen-db QueryEngine implements?
implicit sandboxing
provide structure agnostic access for data.
eventual usage of aggregation pipelines across collections

Feature: Replace Mongogrant with Vault

Vault has done a lot upgrade it's capabilities since the vault store was first created. It would be nice to rebuild our VaultStore so that we can replace mongogrant with all its fickleness and brittleness with Vault.

Overall goals

A single VaulStore class that encapsulates all the interaction with the Vault Server, including potentially storing the Vault access token
access vault only on connect, create the corresponding Store, IE Atlas and MongoDB credentials become a MongoStore, and don't store authentication information
Access to the underlying store should be completely opaque. IE you still run vault_store.query, not vault_store.connected_store.query etc.

Single import-dependence on pynng causes M1 Mac install error

maggma depends on pynng for a single import Pair1:

maggma/src/maggma/cli/distributed.py

Lines 8 to 16 in 9e037d7

    
           from typing import List 
        
           from monty.json import jsanitize 
        
           from monty.serialization import MontyDecoder 
        
           from pynng import Pair1 
        
           from maggma.cli.multiprocessing import multi 
        
           from maggma.core import Builder 
        
           from maggma.utils import tqdm

pynng causes pip install errors on M1 Macs (tracked in codypiersall/pynng#89). Also, see related issue hackingmaterials/atomate#728.

What's your take on maintaining the functionality of Pair1 as part of maggma itself? Else I would try help fix the error upstream.

Copybuilder last_updated newer than source

Add a summary warning when target documents are newer than the source with the output triggered on verbosity flag.

Onotology generation from builder

Another idea to think about:
How do we make it so that builders can also define the ontology? Is this just the pydantic model or can we have it define the rules for how something maps to another. That is can we define the connections a builder is supposed to create in a standard ontology language without making this onerous.

LU translation functions don't serialize

The LU translator functions don't json serialize so the Store no longer dumps to json if you use one of them. I'll leave it up to you to decide how you want to deal with this. I'd think we might just have key type strings and the stores just translate to a standard internal format?

FEATURE: Python file runner

Often we want to parallelize non-module Builders. It would be nice to enable this for at minimum serial and multiprocessing.

Feature: aggregate method on MongoStore

To avoid using hidden attribute store._collection.aggregate

Alternatively, perhaps a pipeline kwarg instead on query

ISSUE: Duplicate printing Jupyter

For some reason running a builder in jupyter causes duplicate messages to print

ZeroMQ based multi-node processing

Since I took out the MPI implementation, I need to build a ZeroMQ based multi-node Executor to run distributed processing.

Builder Reporting

We need some manner of reporting from builders. This needs to be two-tiered. One set of reporting capabilities that come from mrun and one set that can be explicitly provided by the Builder:

mrun reporting:

Start time
End time
Number of items
Total Number of Warnings
Total Number of Errors
Maybe Average time for each item?

Not sure what else to report.

Runner warnings for potential sources of inefficiency

@montoyjh suggested this during my group presentation on Monday. Examples:

A Builder.get_items returns a large list rather than e.g. a cursor.
A Builder.update_items doesn't do a bulk operation e.g. via Store.update.

Move away from mpi4py

This is a note for myself: mpi4py is broken on Cori... again. It seems to be a very fickle library so it would be ideal to move to something like zeromq with auto-handshaking to determine the master/worker designations will be a much more robust method for multi-node scaling.

Move validators for performance

Considering attaching validators to builder instead of store, i.e. a doc_validator, initially only in MapBuilder, would be called inside MapBuilder's process_item and schema then added as a kwarg to the MapBuilder init; this should be more performant.

tqdm in Jupyter

Should from tqdm import tqdm_notebook as tqdm when running inside Jupyter to avoid writing a billion progress bars. I'm not sure the cleanest way to check if running inside Jupyter however -- there seem to be a few options, none of them great.

Update while processing

For @shyamd

Currently, the semaphore for update_targets in the runner causes the entire program to wait for the whole update operation to complete. Decouple this by initiating copying data from the dequeue and then initiating a thread to update targets.

Indescriptive error when not specifying any builders in CLI

The current CLI throws ValueError: Set of coroutines/Futures is empty. when forgetting to pass in builders either as a python file or as a JSON. We should properly catch this and throw an error the user can use.

ISSUE: Test MinIO Integration

Add MinIO to the github actions tests and run full integration tests against this

FEATURE: Upgraded MSONable

MSONable is great but has a couple flaws:

support pydantic BaseModel for defining the data model
better handling of serializing components
Better encapsulation of serialization methods. This is particularly a problem where some classes completely break their as_dict and from_dict.

We should start thinking about a new MSONable defined here and then later if desired moved back into monty.

Separate out S3 Object reference keys from search keys

It's a little ambiguous on how to set up the S3 store to have some subset of search data in the index. This is partly because the keys for creating the S3 Object and the index keys are the same. We should make an init attribute in the S3Store: metadata or something that the user supplies for fields they want to be able to search on. The S3Store should use the keys field to determine uniqueness when it looks for a doc in the index DB to update. This will make it so that the naming of the blob in S3 can just be something like an OID.

@jmmshn @munrojm

FEATURE: Custom Reporting

Add the ability to include custom reporting from builders for reporting store

get_criteria directly invokes mongo commands

Ideally the Store abstracts away database specific commands. get_criteria calls pymongo calls directly. It would be better if it called Store methods and the Stores used the correct calls internally.

Progress Bar

Is there anyway we could build in a global progress bar for builders that also worked well with logger/input outputs and didn't require complex logic?

Seeking clarifications

Since i dont have much experience doing the building stuff, could someone explain:

The drawbacks in the design of the current builder in pymatpro
what new features(other than mpi support and dependency specification capabilities) must the new builder support? As far as i understand the one in pymatpro implements the producer consumer model using the multiprocessing package.
It would help my understanding a lot if someone could give me a concrete fully sketched out example of a builder with complicated dependencies.
What package dependencies are allowed? for example the current get_db function in the helpers module is a reimplementaion of the function in matgendb.utils module. So should we use those from matgendb package or make everything self-contained?

Thanks

Possible Issue with store.close() when using SSHTunnel

See example below for context. Not sure if this is intended behavior or a bug, since forcing a reconnect fixes the issue.

from maggma.stores.mongolike import MongoStore

store = MongoStore(<credentials/etc here>) 
store.connect()
store.query_one({}) # This works fine and has no issues
store.close()

# Since MongoStore.close() does not clear the MongoStore._collection variable, calling connect() does nothing
store.connect() 
# This query is fine if SSHTunnels are not in use, since the connection to the DB is 
# automatically reopened by MongoClient. If SSHTunnel is used, the connection to 
# the not work, since the SSHTunnel connection was already closed
test_doc = store.query_one({})  

# This works, as the connection to the SSHTunnel is re-established
store.connect(force_reset=True) 
test_doc = store.query_one({})

Default key

Default key should probably be _id that is dealt with transparently the same way mongodb does.

Add time limits to process_item? (Possibly just in MapBuilder?)

I was about to add a time limit to BVAnalyzer since @shyamd mentioned some materials were taking too long to complete. However, this seems like it may be a common issue that some analysis tasks just take too long (perhaps pathologically so) for certain materials. A more general solution might be to add a time limit (and time limit kwarg) to process_items, possibly just in MapBuilder itself.

Bug: Bad Warning in Builders

MapBuilder, GroupBuilder, and ProjectionBuilder all have this ensure index warning that is not a good description of what actually went "wrong". It basically warns that one double index is missing when in fact the check is for all relevant indexes: key, last_updated and state; these are also not double indexes. Would be good to update the warning. Here is the linked code:

maggma/src/maggma/builders/map_builder.py

Lines 79 to 86 in 672c6ae

    
           if not all(index_checks): 
        
               self.logger.warning( 
        
                   "Missing one or more important indices on stores. " 
        
                   "Performance for large stores may be severely degraded. " 
        
                   "Ensure indices on target.key and " 
        
                   "[(store.last_updated_field, -1), (store.key, 1)] " 
        
                   "for each of source and target." 
        
               )

Possible solution to "stalled" Runner.run ?

I don't know if this is still an issue / was solved, but I recall folks like @montoyjh mentioning that builder runs mysteriously stalled for a long time without seemingly doing any process_item/update_targets work. I came across the following in an email list to which I subscribe, and I thought it might be applicable to our MultiprocProcessor:

So, how is multiprocessing.Pool broken? The way it works is to start subprocesses by using the POSIX (aka Unix) fork() syscall, which clones all the memory in the process. Typically you then call some variant of execv(), which replaces the copy with a completely new process, but not multiprocessing.Pool. So, when you start a subprocess like this it’s a copy of everything in the parent Python process.

For example, any state stored in a Python module will be copied. I’ve seen child processes in the pool get really confused and try to rotate the parent process' logs, because the Python standard library logging module stores some state about log handling and that gets copied wholesale.

Also, any threads in the parent process are now missing in the forked copy. If your library starts some threads in the background your library will be broken in the subprocess (it’s obscure, but I maintain a library that actually does this). More commonly, thread locks might end up in a bad state, which is my guess for what happened in my case.

In short, the end result is a pool of processes that work some of the time—but sometimes break mysteriously. Python does support a process pool that doesn’t do this horrible fork()-only trick, but you need to know it’s there, and you need to know that the default is broken.

The key idea is to override the context's default "fork" start method. It's the default, but the docs state that "safely forking a multithreaded process is problematic." I know MultiprocProcessor starts a separate thread for updating targets, and uses locks for synchronization, so there may be something screwy happening that can be avoided e.g. by setting the start method to be "forkserver" if supported on the host OS, and "spawn" otherwise.

Feature: Update should return successful keys

the behavior of Store.update should be modified to return the keys for the docs it was able to successfully update.

ditch python 2 and support only 3?

Cursor timeouts common

Anybody else having this issue? Seems to be if a set of process_items takes too long -- it's possible to pass the no_cursor_timeout=True kwarg to Mongolike.query but I'm not sure if that should be done by default.

I'm not sure what the default timeout is, but we could also catch this exception with a recommendation to user on how to increase the timeout...

Edit: Also I'm not sure if the server will respect an increased timeout even if we set it in maggma.

Ensure indices seems to fail silently with read_only credentials

Another bare try/except!

Vault enabled Store

There needs to be a Vault enabled Mongo Store so that credentials can be aquired on the fly

query args order

I know that we've discussed this at least once before, but I think we should reconsider the argument order for criteria and properties for Store.query and Store.query_one. I think criteria should be first and properties should be second. As far as I can tell, the main reason we have properties first is because QueryEngine used to, and as far as I can tell no one who is building dbs for MP using maggma has much experience with or love for QueryEngine.

Basically, I want to be able to teach people how to use maggma by appealing to their knowledge of the MPRester query (or if they've got experience with pymongo, pymongo.Collections.find) by saying "this is basically the same as those things", rather than "this is the same as those things, except the argument order is reversed to keep in line with some other thing we used a long time ago".

mrun failure with 'dict' object has no attribute 'connect'

Using mrun as described in the docs can fail if the Builder is not serialized first.

If I instantiate a Builder inside a jupyter notebook and assign it to the __builder__ variable:

__builder__ = MaterialsBuilder(my_store, target_store, query={"tags":"my_favorite_tag"})

and then execute mrun test_builder.ipynb, I get

Traceback (most recent call last):
  File "/Users/ryan/miniconda3/envs/mp2/bin/mrun", line 8, in <module>
    sys.exit(run())
  File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/maggma/cli/__init__.py", line 80, in run
    asyncio.run(multi(builder, num_workers))
  File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/asyncio/runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/asyncio/base_events.py", line 612, in run_until_complete
    return future.result()
  File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/maggma/cli/multiprocessing.py", line 158, in multi
    builder.connect()
AttributeError: 'dict' object has no attribute 'connect'

However, if I serialize the builder first:

dumpfn(__builder__, 'builder_test.json')

and then run mrun builder_test.json, all works as expected.

Expected behavior

mrun should work when called on a python file or jupyter notebook directly, even if the builder defined in that file has not been serialized.

Non-obvious error message when trying to query a Store that hasn't been connected

AttributeError: 'NoneType' object has no attribute 'find'

Doubly-confusing because you're calling Store.query(), which is aliased to collection.find() ...

Need from pymongo collection

Need a MongoStore.from_pymongo collection

Preserve last_updated for MapBuilder

Add a feature to preserve the last_updated data in MapBuilder.

This is usefull for derived data collections where we lazily just recompute and update everytime we run that builder rather than only updating new things.

Boto DEBUG output floods mrun logs

The boto package used by the S3Store to access MinIO generates a very large amount of DEBUG output, making it difficult to see the output from maggma itself.

Add port auto-negotiation

Right now distributed running requires a hardcoded TCP IP + Port. It would be nice if mrun could auto-find an open port in user-space to use and tell the user if requested.

Running JSONStore.connect() multiple times leads to undefined behavior

I believe it'll either error out (BulkWriteOp error) or lead to duplicate documents, depending on if object _id has been serialized or not.

FEATURE: Serialized SSH Tunnel

The SSH tunnel feature in MongoStore is really nice in python terminal or jupyter-notebook. Ideally, we could serialize the connection and have MongoStore initialize it when deserialized.

ISSUE: newer_in method incompatible with GridFSStore

The _collection property of GridFSStore does not work with confirm_field_index which gets called when newer_in is used.

Question regarding update vs insert

The MongoStore does not provide an insert option, only an update option. This does a bulk_write but as a list of ReplaceOne due to how maggma deals with key handling, which presumably(?) is a lot slower than just a list of insert ops.

I'm building a large number of documents but I've dropped down to using store.collection.insert_many() since the update seems prohibitive.

Is the lack of insert intentional? Has this been problematic for others?

	from typing import List

	from monty.json import jsanitize
	from monty.serialization import MontyDecoder
	from pynng import Pair1

	from maggma.cli.multiprocessing import multi
	from maggma.core import Builder
	from maggma.utils import tqdm

	if not all(index_checks):
	self.logger.warning(
	"Missing one or more important indices on stores. "
	"Performance for large stores may be severely degraded. "
	"Ensure indices on target.key and "
	"[(store.last_updated_field, -1), (store.key, 1)] "
	"for each of source and target."
	)

materialsproject / maggma Goto Github PK

maggma's People

Stargazers

Watchers

Forkers

maggma's Issues

Recommend Projects

Recommend Topics

Recommend Org