Giter Site home page Giter Site logo

maggma's People

Contributors

acrutt avatar andrew-s-rosen avatar davidwaroquiers avatar dependabot-preview[bot] avatar dependabot[bot] avatar dwinston avatar e-kwsm avatar gilbertozp avatar gpetretto avatar jan-janssen avatar jmmshn avatar kbuma avatar lgtm-migrator avatar matk86 avatar mjwen avatar mkhorton avatar ml-evs avatar montoyjh avatar munrojm avatar oooic avatar rkingsbury avatar shreddd avatar shyamd avatar sivonxay avatar tschaume avatar tsmathis avatar utf avatar wuxiaohua1011 avatar yang-ruoxi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

maggma's Issues

Possible solution to "stalled" Runner.run ?

I don't know if this is still an issue / was solved, but I recall folks like @montoyjh mentioning that builder runs mysteriously stalled for a long time without seemingly doing any process_item/update_targets work. I came across the following in an email list to which I subscribe, and I thought it might be applicable to our MultiprocProcessor:

So, how is multiprocessing.Pool broken? The way it works is to start subprocesses by using the POSIX (aka Unix) fork() syscall, which clones all the memory in the process. Typically you then call some variant of execv(), which replaces the copy with a completely new process, but not multiprocessing.Pool. So, when you start a subprocess like this it’s a copy of everything in the parent Python process.

For example, any state stored in a Python module will be copied. I’ve seen child processes in the pool get really confused and try to rotate the parent process' logs, because the Python standard library logging module stores some state about log handling and that gets copied wholesale.

Also, any threads in the parent process are now missing in the forked copy. If your library starts some threads in the background your library will be broken in the subprocess (it’s obscure, but I maintain a library that actually does this). More commonly, thread locks might end up in a bad state, which is my guess for what happened in my case.

In short, the end result is a pool of processes that work some of the time—but sometimes break mysteriously. Python does support a process pool that doesn’t do this horrible fork()-only trick, but you need to know it’s there, and you need to know that the default is broken.

The key idea is to override the context's default "fork" start method. It's the default, but the docs state that "safely forking a multithreaded process is problematic." I know MultiprocProcessor starts a separate thread for updating targets, and uses locks for synchronization, so there may be something screwy happening that can be avoided e.g. by setting the start method to be "forkserver" if supported on the host OS, and "spawn" otherwise.

Add time limits to process_item? (Possibly just in MapBuilder?)

I was about to add a time limit to BVAnalyzer since @shyamd mentioned some materials were taking too long to complete. However, this seems like it may be a common issue that some analysis tasks just take too long (perhaps pathologically so) for certain materials. A more general solution might be to add a time limit (and time limit kwarg) to process_items, possibly just in MapBuilder itself.

Boto DEBUG output floods mrun logs

The boto package used by the S3Store to access MinIO generates a very large amount of DEBUG output, making it difficult to see the output from maggma itself.

Possible Issue with store.close() when using SSHTunnel

See example below for context. Not sure if this is intended behavior or a bug, since forcing a reconnect fixes the issue.

from maggma.stores.mongolike import MongoStore

store = MongoStore(<credentials/etc here>) 
store.connect()
store.query_one({}) # This works fine and has no issues
store.close()

# Since MongoStore.close() does not clear the MongoStore._collection variable, calling connect() does nothing
store.connect() 
# This query is fine if SSHTunnels are not in use, since the connection to the DB is 
# automatically reopened by MongoClient. If SSHTunnel is used, the connection to 
# the not work, since the SSHTunnel connection was already closed
test_doc = store.query_one({})  

# This works, as the connection to the SSHTunnel is re-established
store.connect(force_reset=True) 
test_doc = store.query_one({}) 

Python 3.6 compatability

In theory there are no major issues that should prevent this, but in practice there seem to implementation problems with annotations. Need to figure out how to properly import futures in 3.6 to get the fully compatibility necessary.

Builder Reporting

We need some manner of reporting from builders. This needs to be two-tiered. One set of reporting capabilities that come from mrun and one set that can be explicitly provided by the Builder:

mrun reporting:

  • Start time
  • End time
  • Number of items
  • Total Number of Warnings
  • Total Number of Errors
  • Maybe Average time for each item?

Not sure what else to report.

Feature: Replace Mongogrant with Vault

Vault has done a lot upgrade it's capabilities since the vault store was first created. It would be nice to rebuild our VaultStore so that we can replace mongogrant with all its fickleness and brittleness with Vault.

Overall goals

  • A single VaulStore class that encapsulates all the interaction with the Vault Server, including potentially storing the Vault access token
  • access vault only on connect, create the corresponding Store, IE Atlas and MongoDB credentials become a MongoStore, and don't store authentication information
  • Access to the underlying store should be completely opaque. IE you still run vault_store.query, not vault_store.connected_store.query etc.

Onotology generation from builder

Another idea to think about:
How do we make it so that builders can also define the ontology? Is this just the pydantic model or can we have it define the rules for how something maps to another. That is can we define the connections a builder is supposed to create in a standard ontology language without making this onerous.

Add port auto-negotiation

Right now distributed running requires a hardcoded TCP IP + Port. It would be nice if mrun could auto-find an open port in user-space to use and tell the user if requested.

FEATURE: Upgraded MSONable

MSONable is great but has a couple flaws:

  • support pydantic BaseModel for defining the data model
  • better handling of serializing components
  • Better encapsulation of serialization methods. This is particularly a problem where some classes completely break their as_dict and from_dict.

We should start thinking about a new MSONable defined here and then later if desired moved back into monty.

Question regarding update vs insert

The MongoStore does not provide an insert option, only an update option. This does a bulk_write but as a list of ReplaceOne due to how maggma deals with key handling, which presumably(?) is a lot slower than just a list of insert ops.

I'm building a large number of documents but I've dropped down to using store.collection.insert_many() since the update seems prohibitive.

Is the lack of insert intentional? Has this been problematic for others?

Separate out S3 Object reference keys from search keys

It's a little ambiguous on how to set up the S3 store to have some subset of search data in the index. This is partly because the keys for creating the S3 Object and the index keys are the same. We should make an init attribute in the S3Store: metadata or something that the user supplies for fields they want to be able to search on. The S3Store should use the keys field to determine uniqueness when it looks for a doc in the index DB to update. This will make it so that the naming of the blob in S3 can just be something like an OID.

@jmmshn @munrojm

Move validators for performance

Considering attaching validators to builder instead of store, i.e. a doc_validator, initially only in MapBuilder, would be called inside MapBuilder's process_item and schema then added as a kwarg to the MapBuilder init; this should be more performant.

Bug: Bad Warning in Builders

MapBuilder, GroupBuilder, and ProjectionBuilder all have this ensure index warning that is not a good description of what actually went "wrong". It basically warns that one double index is missing when in fact the check is for all relevant indexes: key, last_updated and state; these are also not double indexes. Would be good to update the warning. Here is the linked code:

if not all(index_checks):
self.logger.warning(
"Missing one or more important indices on stores. "
"Performance for large stores may be severely degraded. "
"Ensure indices on target.key and "
"[(store.last_updated_field, -1), (store.key, 1)] "
"for each of source and target."
)

LU translation functions don't serialize

The LU translator functions don't json serialize so the Store no longer dumps to json if you use one of them. I'll leave it up to you to decide how you want to deal with this. I'd think we might just have key type strings and the stores just translate to a standard internal format?

query args order

I know that we've discussed this at least once before, but I think we should reconsider the argument order for criteria and properties for Store.query and Store.query_one. I think criteria should be first and properties should be second. As far as I can tell, the main reason we have properties first is because QueryEngine used to, and as far as I can tell no one who is building dbs for MP using maggma has much experience with or love for QueryEngine.

Basically, I want to be able to teach people how to use maggma by appealing to their knowledge of the MPRester query (or if they've got experience with pymongo, pymongo.Collections.find) by saying "this is basically the same as those things", rather than "this is the same as those things, except the argument order is reversed to keep in line with some other thing we used a long time ago".

Progress Bar

Is there anyway we could build in a global progress bar for builders that also worked well with logger/input outputs and didn't require complex logic?

mrun failure with 'dict' object has no attribute 'connect'

Using mrun as described in the docs can fail if the Builder is not serialized first.

If I instantiate a Builder inside a jupyter notebook and assign it to the __builder__ variable:

__builder__ = MaterialsBuilder(my_store, target_store, query={"tags":"my_favorite_tag"})

and then execute mrun test_builder.ipynb, I get

Traceback (most recent call last):
  File "/Users/ryan/miniconda3/envs/mp2/bin/mrun", line 8, in <module>
    sys.exit(run())
  File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/maggma/cli/__init__.py", line 80, in run
    asyncio.run(multi(builder, num_workers))
  File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/asyncio/runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/asyncio/base_events.py", line 612, in run_until_complete
    return future.result()
  File "/Users/ryan/miniconda3/envs/mp2/lib/python3.8/site-packages/maggma/cli/multiprocessing.py", line 158, in multi
    builder.connect()
AttributeError: 'dict' object has no attribute 'connect'

However, if I serialize the builder first:

dumpfn(__builder__, 'builder_test.json')

and then run mrun builder_test.json, all works as expected.

Expected behavior

mrun should work when called on a python file or jupyter notebook directly, even if the builder defined in that file has not been serialized.

Cursor timeouts common

Anybody else having this issue? Seems to be if a set of process_items takes too long -- it's possible to pass the no_cursor_timeout=True kwarg to Mongolike.query but I'm not sure if that should be done by default.

I'm not sure what the default timeout is, but we could also catch this exception with a recommendation to user on how to increase the timeout...

Edit: Also I'm not sure if the server will respect an increased timeout even if we set it in maggma.

Move away from mpi4py

This is a note for myself: mpi4py is broken on Cori... again. It seems to be a very fickle library so it would be ideal to move to something like zeromq with auto-handshaking to determine the master/worker designations will be a much more robust method for multi-node scaling.

Criteria/properties order of MongoStore.query

I appreciate they're kwargs anyway, but is there a reason their order is reversed compared to pymongo? Think this might easily confuse people. (Also why query over find? for same reason)

Vault enabled Store

There needs to be a Vault enabled Mongo Store so that credentials can be aquired on the fly

Single import-dependence on pynng causes M1 Mac install error

maggma depends on pynng for a single import Pair1:

from typing import List
from monty.json import jsanitize
from monty.serialization import MontyDecoder
from pynng import Pair1
from maggma.cli.multiprocessing import multi
from maggma.core import Builder
from maggma.utils import tqdm

pynng causes pip install errors on M1 Macs (tracked in codypiersall/pynng#89). Also, see related issue hackingmaterials/atomate#728.

What's your take on maintaining the functionality of Pair1 as part of maggma itself? Else I would try help fix the error upstream.

FEATURE: Serialized SSH Tunnel

The SSH tunnel feature in MongoStore is really nice in python terminal or jupyter-notebook. Ideally, we could serialize the connection and have MongoStore initialize it when deserialized.

Seeking clarifications

Since i dont have much experience doing the building stuff, could someone explain:

  • The drawbacks in the design of the current builder in pymatpro

  • what new features(other than mpi support and dependency specification capabilities) must the new builder support? As far as i understand the one in pymatpro implements the producer consumer model using the multiprocessing package.

  • It would help my understanding a lot if someone could give me a concrete fully sketched out example of a builder with complicated dependencies.

  • What package dependencies are allowed? for example the current get_db function in the helpers module is a reimplementaion of the function in matgendb.utils module. So should we use those from matgendb package or make everything self-contained?

Thanks

Default key

Default key should probably be _id that is dealt with transparently the same way mongodb does.

FEATURE: Jupyter Commands

It would be nice to have a convenient juypter based runner to perform regular and multiprocessing.

FEATURE: Python file runner

Often we want to parallelize non-module Builders. It would be nice to enable this for at minimum serial and multiprocessing.

Update while processing

For @shyamd

Currently, the semaphore for update_targets in the runner causes the entire program to wait for the whole update operation to complete. Decouple this by initiating copying data from the dequeue and then initiating a thread to update targets.

tqdm in Jupyter

Should from tqdm import tqdm_notebook as tqdm when running inside Jupyter to avoid writing a billion progress bars. I'm not sure the cleanest way to check if running inside Jupyter however -- there seem to be a few options, none of them great.

get_criteria directly invokes mongo commands

Ideally the Store abstracts away database specific commands. get_criteria calls pymongo calls directly. It would be better if it called Store methods and the Stores used the correct calls internally.

Query Engine equivalent

There are a lot of key capabilities that need to be wrapped into some class such as query engine that implements find and return capabilities agnostic of find, distinct, etc.

  • This is important to wrap all the different way data can be accessed, e.g. direct mongo or gridFS
  • Implement the same flat properties that pymatgen-db QueryEngine implements?
  • implicit sandboxing
  • provide structure agnostic access for data.
  • eventual usage of aggregation pipelines across collections

Preserve last_updated for MapBuilder

Add a feature to preserve the last_updated data in MapBuilder.

This is usefull for derived data collections where we lazily just recompute and update everytime we run that builder rather than only updating new things.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.