ottogroup / palladium Goto Github PK
View Code? Open in Web Editor NEWFramework for setting up predictive analytics services
License: Apache License 2.0
Framework for setting up predictive analytics services
License: Apache License 2.0
At the moment, only the model can be persisted and loaded. However, there are scenarios that necessitate saving and loading additional data.
E.g., assume that we have a regression problem. We want to normalize the targets to a certain range during training but when calling the predict service, data should be mapped back to the original range. Touching the targets is not part of an sklearn
pipeline, so we may do it during data loading. However, when we start the prediction service, we need to have access to the mapping. Currently, we would have to load the data again to generate the mapping, or try to save the mapping as an attribute of the model.
Ideally, we would be able to just save and load the mapping using palladium tools. The solution should not be too specific to the example above, but be a more general solution to how to persist additional artifacts.
Right now there exist /alive and only one /predict entry points and there is no nice way to specify an individual entry point or even multiple entry points. We've had some cases where we wanted to have more than one /predict variant in one Palladium instance and have implemented it by cloning the predict function and asigning a new entry point.
The solution should allow for specifying multiple predict service instances (and entry point names).
I already have an skorch/pytorch model pickled into a file, How do I use it without the need of retraining or prototyping? and just use the palladium dev-server?
Also, Does it support handling multiple requests in parallel?
Hello!
I found an AI-Specific Code smell in your project.
The smell is called: Columns and DataType Not Explicitly Set
You can find more information about it in this paper: https://dl.acm.org/doi/abs/10.1145/3522664.3528620.
According to the paper, the smell is described as follows:
Problem | If the columns are not selected explicitly, it is not easy for developers to know what to expect in the downstream data schema. If the datatype is not set explicitly, it may silently continue the next step even though the input is unexpected, which may cause errors later. The same applies to other data importing scenarios. |
---|---|
Solution | It is recommended to set the columns and DataType explicitly in data processing. |
Impact | Readability |
Example:
### Pandas Column Selection
import pandas as pd
df = pd.read_csv('data.csv')
+ df = df[['col1', 'col2', 'col3']]
### Pandas Set DataType
import pandas as pd
- df = pd.read_csv('data.csv')
+ df = pd.read_csv('data.csv', dtype={'col1': 'str', 'col2': 'int', 'col3': 'float'})
You can find the code related to this smell in this link:
Lines 232 to 252 in 7ff90ac
I also found instances of this smell in other files, such as:
.
I hope this information is helpful!
When accessing get_config()['model']
you might not get the current active model as the config is not updated after loading or updating models. If a prediction server is started, you will get an instantiated model object as specified in the model section of the config.
/alive and /predict entry points are still using the correct active model.
Current workaround for the CachedUpdatePersister to get the active model is to use the process_store in palladium.util as it is updated by the CachedUpdatePersister:
from palladium.util get process_store
process_store.get('model')
I have written an S3 persister (FileIO
and FileLike
) based on s3fs
and thought about making a PR (if you are interested) but I assume that you probably don't want a direct dependency to packages such as s3fs
and boto3
. Two questions:
The /alive call does not work with psutil v. 3 upwards because it seems the get_memory_info method does no longer exist. A fix would be welcome.
I can't find it. Where should I look to find the documentation for pld-export
?
If the path specified in the File model persister section of Palladium's config points to a non-existent directory, persisting the model will fail (after training). It would be good to have an option that will create the directory if it does not exist yet (and set it as default behavior).
If Palladium is started with the pld-devserver script, the initialization (initialize_config) is run directly before exposing the service. Using gunicorn with Palladium leads to a delayed initialization, i.e., initialize_config will not be run before the first request has arrived. This can lead to bad response times for first requests after start-up.
Another observation is that pld-devserver binds the port when the service is up and running whereas gunicorn (at least in default setting) binds the port early and incoming requests are blocked until Palladium's initialization has finished.
To solve the first issue, it would be possible to run initialize_config() directly in the module to harmonize the behavior, but due to some dependencies during start-up phase I haven't found a very elegant solution yet. (Multiple initialize_config calls would happen then if the pld-fit / pld-test / etc. scripts are run.)
When using a model persister (does not seem to matter whether it is palladium.persistence.CachedUpdatePersister
or palladium.persistence.File
), starting a service where model loading throws an exception swallows the underlying error and simply reports a KeyError: model
:
ERROR:palladium:Unexpected error
Traceback (most recent call last):
File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/server.py", line 201, in predict
model = model_persister.read()
File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/persistence.py", line 484, in read
return self.cache[self.key]
File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/util.py", line 214, in __getitem__
return super(ProcessStore, self).__getitem__(key)
File "/opt/conda/envs/blotto/lib/python3.6/collections/__init__.py", line 993, in __getitem__
raise KeyError(key)
KeyError: 'model'
However if I run get_config
manually I will get the underlying error:
>>> from palladium.util import get_config
>>> c = get_config()
DEBUG:gensim.models.doc2vec:Fast version of gensim.models.doc2vec is being used
INFO:summa.preprocessing.cleaner:'pattern' package not found; tag filters are not available for English
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/util.py", line 91, in get_config
_initialize_config(_config)
File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/util.py", line 136, in _initialize_config
component.initialize_component(config)
File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/persistence.py", line 473, in initialize_component
self.update_cache()
File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/util.py", line 365, in wrapper
return self.wrapped(*args, **kwargs)
File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/persistence.py", line 490, in update_cache
model = self.impl.read(*args, **kwargs)
File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/persistence.py", line 94, in read
return pickle.load(f)
File "/opt/conda/envs/blotto/lib/python3.6/site-packages/torch/cuda/__init__.py", line 492, in _lazy_new
_lazy_init()
File "/opt/conda/envs/blotto/lib/python3.6/site-packages/torch/cuda/__init__.py", line 160, in _lazy_init
_check_driver()
File "/opt/conda/envs/blotto/lib/python3.6/site-packages/torch/cuda/__init__.py", line 74, in _check_driver
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
As suggested in #22
Is this project still active?
Are there alternatives?
The pandas.rpy
module was deprecated and removed in Pandas 0.20. Running tests with Pandas >= 0.20 currently produces an error.
Notes on how to upgrade code to run with newer Pandas (using the rpy2 package) is available in the Pandas documentation.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.