ottogroup / palladium Goto Github PK

View Code? Open in Web Editor NEW

483.0 483.0 58.0 482 KB

Framework for setting up predictive analytics services

License: Apache License 2.0

Shell 1.95% Python 97.37% R 0.37% Dockerfile 0.30%

data-science machine-learning scikit-learn

palladium's People

Contributors

Stargazers

Watchers

Forkers

josephjacks silky sarvex intermezzo-fr velankanisys sheltowt rammig codeaudit kaynewest laisun ronert justinrichie feliperyan alexsavio ljzzju cogmission cypher92 owajawa hydrosquall colinsongf samsungaccelerator jsonbao minhpascal ddbs idroz willwil mjain-1 tomekschulz ondrocks solertis dnouri hamedmp benjamin-work systemstart sayreblades true-dev githubuser1983 grzr jcriscione vishalbelsare essobi rizplate stenpiren afcarl bigrlab leepand qlew krebssarah pplonski yv ottonemo arm7ai antpaaa hfhdsfhshfd scieneers-jw thedatabae marmikreal skyleraiguy

palladium's Issues

Update docs on how to use nested lists in config

[Feature request] Add a possibility to persist artifacts besides the model itself

At the moment, only the model can be persisted and loaded. However, there are scenarios that necessitate saving and loading additional data.

E.g., assume that we have a regression problem. We want to normalize the targets to a certain range during training but when calling the predict service, data should be mapped back to the original range. Touching the targets is not part of an sklearn pipeline, so we may do it during data loading. However, when we start the prediction service, we need to have access to the mapping. Currently, we would have to load the data again to generate the mapping, or try to save the mapping as an attribute of the model.

Ideally, we would be able to just save and load the mapping using palladium tools. The solution should not be too specific to the example above, but be a more general solution to how to persist additional artifacts.

Flexible handling of (multiple) entry points in Palladium's config

Right now there exist /alive and only one /predict entry points and there is no nice way to specify an individual entry point or even multiple entry points. We've had some cases where we wanted to have more than one /predict variant in one Palladium instance and have implemented it by cloning the predict function and asigning a new entry point.

The solution should allow for specifying multiple predict service instances (and entry point names).

How do i use, with a pretrained skorch/pytorch model ?

I already have an skorch/pytorch model pickled into a file, How do I use it without the need of retraining or prototyping? and just use the palladium dev-server?
Also, Does it support handling multiple requests in parallel?

Columns and DataType Not Explicitly Set on line 242 of fit.py

Hello!

I found an AI-Specific Code smell in your project.
The smell is called: Columns and DataType Not Explicitly Set

You can find more information about it in this paper: https://dl.acm.org/doi/abs/10.1145/3522664.3528620.

According to the paper, the smell is described as follows:

Problem	If the columns are not selected explicitly, it is not easy for developers to know what to expect in the downstream data schema. If the datatype is not set explicitly, it may silently continue the next step even though the input is unexpected, which may cause errors later. The same applies to other data importing scenarios.
Solution	It is recommended to set the columns and DataType explicitly in data processing.
Impact	Readability

Example:

### Pandas Column Selection
import pandas as pd
df = pd.read_csv('data.csv')
+ df = df[['col1', 'col2', 'col3']]

### Pandas Set DataType
import pandas as pd
- df = pd.read_csv('data.csv')
+ df = pd.read_csv('data.csv', dtype={'col1': 'str', 'col2': 'int', 'col3': 'float'})

You can find the code related to this smell in this link:

palladium/palladium/fit.py

Lines 232 to 252 in 7ff90ac

    
                       "want to define a 'scoring' option in the configuration." 
        
                       ) 
        
               search = GridSearchCV(model, **search_kwargs) 
        
           else: 
        
               search = grid_search 
        
           with timer(logger.info, "Running grid search"): 
        
               search.fit(X, y) 
        
           results = pandas.DataFrame(search.cv_results_) 
        
           pandas.options.display.max_rows = len(results) 
        
           pandas.options.display.max_columns = len(results.columns) 
        
           if 'rank_test_score' in results: 
        
               results = results.sort_values('rank_test_score') 
        
           print(results) 
        
           if save_results: 
        
               results.to_csv(save_results, index=False) 
        
           if persist_best: 
        
               _persist_model(search, model_persister, activate=True) 
        
           return search

I also found instances of this smell in other files, such as:

I hope this information is helpful!

config's model entry is not updated after a model is loaded or updated

When accessing get_config()['model'] you might not get the current active model as the config is not updated after loading or updating models. If a prediction server is started, you will get an instantiated model object as specified in the model section of the config.

/alive and /predict entry points are still using the correct active model.

Current workaround for the CachedUpdatePersister to get the active model is to use the process_store in palladium.util as it is updated by the CachedUpdatePersister:

from palladium.util get process_store
process_store.get('model')

Additional persisters

I have written an S3 persister (FileIO and FileLike) based on s3fs and thought about making a PR (if you are interested) but I assume that you probably don't want a direct dependency to packages such as s3fs and boto3. Two questions:

are you interested in a PR at all
if so, how to integrate the S3 persister without cluttering the dependencies

psutil v. 3 compatibility

The /alive call does not work with psutil v. 3 upwards because it seems the get_memory_info method does no longer exist. A fix would be welcome.

Documentation for pld-export seems missing

I can't find it. Where should I look to find the documentation for pld-export?

File model persister should provide option to create directory

If the path specified in the File model persister section of Palladium's config points to a non-existent directory, persisting the model will fail (after training). It would be good to have an option that will create the directory if it does not exist yet (and set it as default behavior).

Avoid delayed initialization using other WSGI servers

If Palladium is started with the pld-devserver script, the initialization (initialize_config) is run directly before exposing the service. Using gunicorn with Palladium leads to a delayed initialization, i.e., initialize_config will not be run before the first request has arrived. This can lead to bad response times for first requests after start-up.
Another observation is that pld-devserver binds the port when the service is up and running whereas gunicorn (at least in default setting) binds the port early and incoming requests are blocked until Palladium's initialization has finished.

To solve the first issue, it would be possible to run initialize_config() directly in the module to harmonize the behavior, but due to some dependencies during start-up phase I haven't found a very elegant solution yet. (Multiple initialize_config calls would happen then if the pld-fit / pld-test / etc. scripts are run.)

Model persister ignores AssertionError

When using a model persister (does not seem to matter whether it is palladium.persistence.CachedUpdatePersister or palladium.persistence.File), starting a service where model loading throws an exception swallows the underlying error and simply reports a KeyError: model:

ERROR:palladium:Unexpected error
Traceback (most recent call last):
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/server.py", line 201, in predict
    model = model_persister.read()
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/persistence.py", line 484, in read
    return self.cache[self.key]
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/util.py", line 214, in __getitem__
    return super(ProcessStore, self).__getitem__(key)
  File "/opt/conda/envs/blotto/lib/python3.6/collections/__init__.py", line 993, in __getitem__
    raise KeyError(key)
KeyError: 'model'

However if I run get_config manually I will get the underlying error:

>>> from palladium.util import get_config
>>> c = get_config()
DEBUG:gensim.models.doc2vec:Fast version of gensim.models.doc2vec is being used
INFO:summa.preprocessing.cleaner:'pattern' package not found; tag filters are not available for English
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/util.py", line 91, in get_config
    _initialize_config(_config)
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/util.py", line 136, in _initialize_config
    component.initialize_component(config)
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/persistence.py", line 473, in initialize_component
    self.update_cache()
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/util.py", line 365, in wrapper
    return self.wrapped(*args, **kwargs)
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/persistence.py", line 490, in update_cache
    model = self.impl.read(*args, **kwargs)
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/palladium/persistence.py", line 94, in read
    return pickle.load(f)
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/torch/cuda/__init__.py", line 492, in _lazy_new
    _lazy_init()
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/torch/cuda/__init__.py", line 160, in _lazy_init
    _check_driver()
  File "/opt/conda/envs/blotto/lib/python3.6/site-packages/torch/cuda/__init__.py", line 74, in _check_driver
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

	"want to define a 'scoring' option in the configuration."
	)

	search = GridSearchCV(model, **search_kwargs)
	else:
	search = grid_search

	with timer(logger.info, "Running grid search"):
	search.fit(X, y)

	results = pandas.DataFrame(search.cv_results_)
	pandas.options.display.max_rows = len(results)
	pandas.options.display.max_columns = len(results.columns)
	if 'rank_test_score' in results:
	results = results.sort_values('rank_test_score')
	print(results)
	if save_results:
	results.to_csv(save_results, index=False)
	if persist_best:
	_persist_model(search, model_persister, activate=True)
	return search