Comments (5)
Thanks for putting this together, @ablaom.
A few notes and suggestions from my end.
Regarding logging=nothing
and 1. Serializing machines
, it might be useful to think of different types of arguments we might provide. As per MLFlowClient
's reference, we have 3 main types that we can use as parameters regarding logging.
MLFlowClient.MLFlow - this is a type which we use to define an mlflow client, it is usually instantiated such as mlf = MLFlow("http://localhost:5000")
.
Then when we create an experiment and a run, it looks like this:
# Create MLFlow instance
mlf = MLFlow("http://localhost:5000")
# Initiate new experiment
experiment_id = createexperiment(mlf; name="experiment name, default is a uuid")
# Create a run in the new experiment
exprun = createrun(mlf, experiment_id)
I'll start with the simplest case described in the original post here:
Serializing machines: Calling MLJModelInterface.save(location, mach) whenever location is an mljflow experiment (instead of a path to file).
location
could either be an MLFlow
, MLFlowExperiment
, or MLFlowRun
.
The most obious case is when we provide an MLFlowRun
. Runs belong to experiments and experiments belong to an mlflow instance. A single experiment may have 0 or more runs.
Thus, we could define:
MLJModelInterface.save(location::MLFlowRun, mach)
- save the machine as a serialized artifact in an existing run.MLJModelInterface.save(location::MLFlowExperiment, mach)
- create a new run in an existing experiment and fall back toMLJModelInterface.save(location::MLFlowRun, mach)
MLJModelInterface.save(location::MLFlow, mach)
- create a new experiment in the providedlocation::MLFlow
and fall back toMLJModelInterface.save(location::MLFlowExperiment, mach)
We can use similar logic when initiating logging from different places, such as performance evaluation, hyperparameter tuning, and controlled model iteration.
from mlj.jl.
@deyandyankov I bundled a MLFlow
object inside a general MLFlowInstance
type that allows us to save the most important project configurations: base_uri
, experiment_name
and artifact_location
(we can expand them, it's just a draft). You can see more about that here. With that, there is no piece of code from MLFlowClient loaded at first glance. We need to import the library first to operate with the methods that logs our info; and if that's not the case, it's easy to send an error requesting things to user.
from mlj.jl.
Proposed behaviour
When should MLJ actions trigger mlflow logging?
It should be possible to request logging for these actions:
- Serializing machines: Calling
MLJModelInterface.save(location, mach)
wheneverlocation
is an mljflow experiment (instead of a path to file). - Performance evaluation: Calling
evaluate(mach, ...)
orevaluate(model, data..., ...)
for anymach
/model
(including composite models, such as pipelines) - Hyperparameter tuning: Calling
MLJModelInterface.fit(TunedModel(model, ...), ...)
for anymodel
(and hence callingfit!
on an associated machine) - Controlled model iteration: Calling
MLJModelInterface.fit(IteratedModel(model, ...), ...)
for anymodel
(and hence callingfit!
on an associated machine)
Moreover, it should be possible to arrange automatic logging, i.e., without explicitly requesting logging for each such action.
What should be logged?
1. Serializing machines
- the file ordinarily created by
save(file, mach)
should be instead saved as an mlflow artifact - additionally, all hyperparameters (i.e., a suitably unpacked representation of
model
)
2. Performance evaluation
Compulsory
- all hyperparameters (i.e., a suitably unpacked representation of
model
) - names
measures
(aka metrics) applied - each corresponding aggregate
measurement
And, if possible:
- the resampling strategy used (e.g.,
CV
) and, if possible, it's parameters (e.g.,nfolds
) - value of
repeats
(to signal possibility this is a Monte Carlo variation of resampling)
Optional (included by default)
- the explicit row indices for the each train/test fold pair
3. Hyperparameter tuning
Compulsory
For the optimal model:
- The same compulsory items in 2.
Optional (included by default)
For each model in the search (each hyperparameter-set):
- the same compulsory items as in 1, although it might suffice to only log hyperparameters
that change during training
3. Controlled model iteration
Compulsory
For the final trained model (different from the last "evaluated" model, if retrain=true
;
see
here)
- The same compulsory items in 2, plus a final training error, if available (not all
iterative MLJ models report a training loss)
Optional (included by default)
For the partially trained model at every "break point" in the iteration:
-
The same compulsory items in 2, plus a final training error, if available
-
Serialization of the corresponding "training machine" (see
docs),
as an artifact.
How should logging be structured
I'm less clear about details here, but here are some comments:
-
In tuning, each model evaluated should be a separate run within same experiment as the optimal model run
-
Iteration would be similarly structured
-
Since a model (hyperparameter set) can be nested (e.g, pipelines and wrappers), I
suggest that a flattened version of the full tree of parameters be computed for purposes
of logging, and suggestive composite names created for the nested
hyperparameters. Possibly, we may want to additionally logmodel
as a julia-serialized
artifact??
User interface points
Some suggestions:
How does user request logging?
In serialization, one just replaces location in MLJBase.save(location, mach)
with the
(wrapped?) mljflow experiment.
In performance evaluation, we add a new kwarg logger=nothing
to evaluate
/evaluate!
which user can set to an (wrapped?) mljflow experiment.
Cases 3 and 4 are similar but logger=nothing
becomes new field of the wrapper
(TunedModel
or IteratedModel
structs).
How does user request auto logger?
Add global variable DEFAULT_LOGGER
accessed/set by user with new methods
logger()/logger(default_logger)
initialized to nothing
in __init__
, and change the
above defaults from logger=nothing
to logger=DEFAULT_LOGGER
.
How does user suppress optional logging?
We could eigher extra kwargs/fields to control level of verbosity, or if we are wrapping
experiments anyway, include the verbosity level in the experiment wrapper. I'm leaning
towards the latter (or just making everying compulsory).
from mlj.jl.
Some miscellaneous thoughts on implementation
-
A proof of concept already exists for performance evaluation. This shows how to add the new functionality using an extension
module, which also forces us to keep the extension as disentangled from current
functionality as much as possible, for better maintenance. -
When a
TunedModel
isfit
, it "essentially" callsevaluate!
on each model in the
tuning range, so we can get some functionality in that case by simply passing the
logger
parameter on. What actually happens is thatfit
wraps the model as
Resampler(model, ...)
,
which has fields for each kwarg ofevaluate
, this resampler gets wrapped as a
machine, trained, and then a specialevaluate
method is called on this machine to get
the actual evaluation object. So we also need to addlogger
to theResampler
struct
(which is not public) -
Some hints about how to flatten models appear
here
and here. -
In
IteratedModel
we already have the
Save
control
. Currently the default filename is "machine.jls", but if!isnothing(logger)
we could instead passlogger
as the default. Then, we change the default for
controls
to includeSave()
if!isnothing(logger)
. I imagine something similar
could be worked out forWithEvaluationDo
andWithTrainingLossesDo
to get the other
information we want logged.
from mlj.jl.
cc @pebeto @deyandyankov @tlienart @darenasc
from mlj.jl.
Related Issues (20)
- Update list of BetaML models HOT 1
- Reinstate CatBoost integraton test HOT 1
- Upate ROADMAP.md HOT 1
- Improve documentation by additional hierarchy HOT 5
- Include support for MixedModels.jl HOT 2
- Deserialisation fails for wrappers like `TunedModel` when atomic model overloads `save/restore` HOT 2
- feature_importances for Pipeline including XGBoost don't work HOT 2
- Current performance evaluation objects, recently added to TunedModel histories, are too big HOT 2
- Update cheat sheet instance of depracated `@from_network` code
- Requesting better exposure to MLJFlux in the model browser HOT 4
- Reexport `CompactPerformanceEvaluation` and `InSample`
- Remove `info(rms)` from the cheatsheet HOT 4
- Re-instate integration tests for scikit-learn models
- [tracking] Add default logger to MLJ
- Enable entry of model wrappers into the MLJ Model Registry
- Link in examples on CV Recursive Feature Elimination into the manual or in the planned tutorial interface. HOT 1
- broken link for UnivariateFinite doc string
- Add Missingness Encoder Transformer HOT 1
- Add pipeline support for `Unsupervised` models that have a target in `fit`
- InteractionTransformer is missing from the "Transformers and Other..." manual page
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mlj.jl.