Giter Site home page Giter Site logo

ottogroup / palladium Goto Github PK

View Code? Open in Web Editor NEW
483.0 483.0 58.0 482 KB

Framework for setting up predictive analytics services

License: Apache License 2.0

Shell 1.95% Python 97.37% R 0.37% Dockerfile 0.30%
data-science machine-learning scikit-learn

palladium's People

Contributors

alattner avatar benjamin-work avatar dnouri avatar grzr avatar hgraubit avatar m-jain-1 avatar ottogroup-com avatar ottonemo avatar sayreblades avatar scieneers-jw avatar yversley-ottogroup avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

palladium's Issues

[Feature request] Add a possibility to persist artifacts besides the model itself

At the moment, only the model can be persisted and loaded. However, there are scenarios that necessitate saving and loading additional data.

E.g., assume that we have a regression problem. We want to normalize the targets to a certain range during training but when calling the predict service, data should be mapped back to the original range. Touching the targets is not part of an sklearn pipeline, so we may do it during data loading. However, when we start the prediction service, we need to have access to the mapping. Currently, we would have to load the data again to generate the mapping, or try to save the mapping as an attribute of the model.

Ideally, we would be able to just save and load the mapping using palladium tools. The solution should not be too specific to the example above, but be a more general solution to how to persist additional artifacts.

Flexible handling of (multiple) entry points in Palladium's config

Right now there exist /alive and only one /predict entry points and there is no nice way to specify an individual entry point or even multiple entry points. We've had some cases where we wanted to have more than one /predict variant in one Palladium instance and have implemented it by cloning the predict function and asigning a new entry point.

The solution should allow for specifying multiple predict service instances (and entry point names).

How do i use, with a pretrained skorch/pytorch model ?

I already have an skorch/pytorch model pickled into a file, How do I use it without the need of retraining or prototyping? and just use the palladium dev-server?
Also, Does it support handling multiple requests in parallel?

Columns and DataType Not Explicitly Set on line 242 of fit.py

Hello!

I found an AI-Specific Code smell in your project.
The smell is called: Columns and DataType Not Explicitly Set

You can find more information about it in this paper: https://dl.acm.org/doi/abs/10.1145/3522664.3528620.

According to the paper, the smell is described as follows:

Problem If the columns are not selected explicitly, it is not easy for developers to know what to expect in the downstream data schema. If the datatype is not set explicitly, it may silently continue the next step even though the input is unexpected, which may cause errors later. The same applies to other data importing scenarios.
Solution It is recommended to set the columns and DataType explicitly in data processing.
Impact Readability

Example:

### Pandas Column Selection
import pandas as pd
df = pd.read_csv('data.csv')
+ df = df[['col1', 'col2', 'col3']]

### Pandas Set DataType
import pandas as pd
- df = pd.read_csv('data.csv')
+ df = pd.read_csv('data.csv', dtype={'col1': 'str', 'col2': 'int', 'col3': 'float'})

You can find the code related to this smell in this link:

palladium/palladium/fit.py

Lines 232 to 252 in 7ff90ac

"want to define a 'scoring' option in the configuration."
)
search = GridSearchCV(model, **search_kwargs)
else:
search = grid_search
with timer(logger.info, "Running grid search"):
search.fit(X, y)
results = pandas.DataFrame(search.cv_results_)
pandas.options.display.max_rows = len(results)
pandas.options.display.max_columns = len(results.columns)
if 'rank_test_score' in results:
results = results.sort_values('rank_test_score')
print(results)
if save_results:
results.to_csv(save_results, index=False)
if persist_best:
_persist_model(search, model_persister, activate=True)
return search
.

I also found instances of this smell in other files, such as:

.

I hope this information is helpful!

config's model entry is not updated after a model is loaded or updated

When accessing get_config()['model'] you might not get the current active model as the config is not updated after loading or updating models. If a prediction server is started, you will get an instantiated model object as specified in the model section of the config.

/alive and /predict entry points are still using the correct active model.

Current workaround for the CachedUpdatePersister to get the active model is to use the process_store in palladium.util as it is updated by the CachedUpdatePersister:

from palladium.util get process_store
process_store.get('model')

Additional persisters

I have written an S3 persister (FileIO and FileLike) based on s3fs and thought about making a PR (if you are interested) but I assume that you probably don't want a direct dependency to packages such as s3fs and boto3. Two questions:

  • are you interested in a PR at all
  • if so, how to integrate the S3 persister without cluttering the dependencies

psutil v. 3 compatibility

The /alive call does not work with psutil v. 3 upwards because it seems the get_memory_info method does no longer exist. A fix would be welcome.

File model persister should provide option to create directory

If the path specified in the File model persister section of Palladium's config points to a non-existent directory, persisting the model will fail (after training). It would be good to have an option that will create the directory if it does not exist yet (and set it as default behavior).

Avoid delayed initialization using other WSGI servers

  1. If Palladium is started with the pld-devserver script, the initialization (initialize_config) is run directly before exposing the service. Using gunicorn with Palladium leads to a delayed initialization, i.e., initialize_config will not be run before the first request has arrived. This can lead to bad response times for first requests after start-up.

  2. Another observation is that pld-devserver binds the port when the service is up and running whereas gunicorn (at least in default setting) binds the port early and incoming requests are blocked until Palladium's initialization has finished.

To solve the first issue, it would be possible to run initialize_config() directly in the module to harmonize the behavior, but due to some dependencies during start-up phase I haven't found a very elegant solution yet. (Multiple initialize_config calls would happen then if the pld-fit / pld-test / etc. scripts are run.)

Model persister ignores AssertionError

When using a model persister (does not seem to matter whether it is palladium.persistence.CachedUpdatePersister or palladium.persistence.File), starting a service where model loading throws an exception swallows the underlying error and simply reports a KeyError: model:

ERROR:palladium:Unexpected error
Traceback (most recent call last):
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/server.py", line 201, in predict
    model = model_persister.read()
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/persistence.py", line 484, in read
    return self.cache[self.key]
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/util.py", line 214, in __getitem__
    return super(ProcessStore, self).__getitem__(key)
  File "/opt/conda/envs/blotto/lib/python3.6/collections/__init__.py", line 993, in __getitem__
    raise KeyError(key)
KeyError: 'model'

However if I run get_config manually I will get the underlying error:

>>> from palladium.util import get_config
>>> c = get_config()
DEBUG:gensim.models.doc2vec:Fast version of gensim.models.doc2vec is being used
INFO:summa.preprocessing.cleaner:'pattern' package not found; tag filters are not available for English
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/util.py", line 91, in get_config
    _initialize_config(_config)
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/util.py", line 136, in _initialize_config
    component.initialize_component(config)
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/persistence.py", line 473, in initialize_component
    self.update_cache()
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/util.py", line 365, in wrapper
    return self.wrapped(*args, **kwargs)
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/persistence.py", line 490, in update_cache
    model = self.impl.read(*args, **kwargs)
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/persistence.py", line 94, in read
    return pickle.load(f)
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/torch/cuda/__init__.py", line 492, in _lazy_new
    _lazy_init()
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/torch/cuda/__init__.py", line 160, in _lazy_init
    _check_driver()
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/torch/cuda/__init__.py", line 74, in _check_driver
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.