juliaai / mlj.jl Goto Github PK

A Julia machine learning framework

Home Page: https://juliaai.github.io/MLJ.jl/

License: Other

Julia 100.00%

machine-learning julia pipelines tuning data-science tuning-parameters predictive-modeling classification regression statistics

mlj.jl's People

Contributors

Stargazers

Watchers

Forkers

lsindoni roberthoenig ayush1999 andylamp swenkel ysimillides mkborregaard uday1889 jangocity jangocheng hhy5277 m-vollmer fernandogelin ulinares giordano batmanabcdefg statmixedml spencerai smith6036 l5071134 luyifanlu playboyw manolaz roysh ayush-1506 wegamekinglc nknowlton deepb1t gaybro8777 jpsamaroo davidbp vollmersj dannyb2018 longhua8800w abostrom britneyzeng mloning yumizhou47 garoc371 royahe w090613 andrewliujian isakfalk apppasche neviim juvu foeinlove stjordanis juliohm kryohi harrisonwilde oxoaresearch versipellis abrahamnava96 nunofernandes-plight nilshg evelinag beamiter phymucs lhnguyen-vn amgalanb tlienart lbfin sparkler0323 zgornel kmsquire cstjean lodewijkbrand irudik pallharaldsson pshashk arimkatz pdwaggoner cameronbieganek ven-k darenasc simongarisch aviatesk aa25desh kylejones200 lthomiso drcxcruz sonyeric terasakisatoshi sh4pe tranquilhero dawievlill okonsamuel scottbigbrain boliu-christine mc-o logankilpatrick dsweber2 expandingman mojojojoe swipswaps kristofferc ondrejslamecka arvganesh adityasaini70

mlj.jl's Issues

Add resampling by cross-validation

In (src/resampling.jl)[https://github.com/alan-turing-institute/MLJ.jl/blob/master/src/resampling.jl] I have defined an object for CV resampling strategy:

mutable struct CV <: ResamplingStrategy
    n_folds::Int
    is_parallel::Bool
end
CV(; n_folds=6) = CV(n_folds)

But we still need corresponding fit and evaluate methods for the Resampler{CV} model, in analogy with the Resampler{Holdout} methods. And a se method as well. (The fitresult of the fit method will be the vector of cv scores; evaluate will return the mean and se the standard deviation / sqrt(n_folds).)

Error in installing package as given in readme

(v0.7) pkg> add https://github.com/alan-turing-institute/MLJ.jl/tree/master doesn't seem to work, I get:

ERROR: failed to clone from https://github.com/alan-turing-institute/MLJ.jl/tree/master, error: GitError(Code:ERROR, Class:Net, unexpected HTTP status code: 404)

Boosting Packages

During the last call, various libraries where mentioned for boosting algorithms.

The three options we discussed (and from a brief search most commonly used) are :

CatBoost, which currently has no Julia interface, although someone mentioned an interest on his (developers?) side to port it.
XGBoost, which currently has a Julia interface, with a version working for v0.7 and v1.0, alongside tests and a build.jl file to automatically build it if needed when using it. Although this version is not yet "tagged", with the new package manager this is less of an issue (and I can always get in touch with them to speed this process up if necessary).
LightGBM, which has an unmaintained Julia interface, not currently working on v0.7, with no tests present. It also requires the separate building of the dependency, although with Requires / writing our own build file this can be somewhat managed.

From an ease of use perspective it seems to be wisest to integrate/copy over XGBoost, although from a performance/feature point of view a different package might be more optimal if you have some suggestions.

Also some issues are open regarding packaging/upgrading such as #4, #7 and from what I can gather, these have now been fixed? (module registering the package itself?)

FeatureSelector not data agnostic

Randomise partition by default

I was a bit surprised to see that the partition function does not randomise but takes the first say 70%. This can be problematic when (for instance) the dataset is ordered by response (e.g. the crabs dataset).

A possible way would be to just add a keyword rand which can take a (possibly seeded) RNG and shuffle on that?

using Random
function partition(rows::AbstractVector{Int}, fractions...; rng::Union{Nothing, AbstractRNG}=Random.seed!())
    rows = collect(rows)
    !(rand === nothing) && shuffle!(rng, rows)
    rowss = []
    if sum(fractions) >= 1
        throw(DomainError)
    end
    n_patterns = length(rows)
    first = 1
    for p in fractions
        n = round(Int, p*n_patterns)
        n == 0 ? (@warn "A split has only one element"; n = 1) : nothing
        push!(rowss, rows[first:(first + n - 1)])
        first = first + n
    end
    if first > n_patterns
        @warn "Last vector in the split has only one element."
        first = n_patterns
    end
    push!(rowss, rows[first:n_patterns])
    return tuple(rowss...)
end

happy to open a PR if deemed worthy.

PS: I guess an alternative is also to feed a randperm to partition.

Nested parameter definitions

Nested parameter sets are currently not supported we could borrow from JuMP
ps = makeParamSet(

makeDiscreteParam("kernel", values=c("polydot", "rbfdot")),

makeNumericParam("C", lower=-15, upper=15, trafo=function(x) 2^x),

makeNumericParam("sigma", lower=-15, upper=15, trafo=function(x) 2^x,

requires = quote(kernel == "rbfdot")),

makeIntegerParam("degree", lower = 1, upper = 5,

requires = quote(kernel == "polydot")))

prediction API for probabilistic classifiers

I think we should not follow the common prediction interface for probabilistic classification which returns an array of probabilities. The reason is that this does not generalize well to multi-label classification, or regression. In fact, it does not generalize at all without various hacks and "secret conventions" that make these cases quite ugly in current interface designs such as mlr or sklearn.

I might be biased, but I'd prefer that probabilistic learners return a distribution interface such as in
https://github.com/alan-turing-institute/skpro
This avoids having to think about how exactly to represent distributions in a return object and proliferating a large set of arbitrary design conventions depending on the metadata/task.

Implement MLJ interface for linear models

Would someone like to implement the new MLJ interface for linear models for which julia code already exists, including:

GLM.jl for many of these

Lasso.jl ~~which needs upgrading from 0.6.~~

MultivarateStats

GLM

basic interface for some of the GLM regression models JuliaAI/MLJModels.jl#27

Lasso.jl

lasso regression
fused lasso
trend filter
gamma lasso

Multivariate stats

OLS
Ridge

Relevant:

https://discourse.julialang.org/t/how-to-fit-a-glm-to-all-unnamed-features-of-arbitrary-design-matrix/20490

Integrating online and active learning models

Integrating OnlineStats (its online learning algorithms) and giving it an easy to use hyperparameter tuning context makes Julia even more useful for quick ML on real big data.

Add task interface

Goals and collaboration on tasks

In order to facilitate working with multiple project members we need to first:

Create a design document, that outlines APIs between the various parts of the package.
Create a working prototype (complete with installation and examples)

Then individual tasks could be assigned as:

Integration into julia package manager ( make it work with Pkg.Clone() for now, then think about optional dependencies, when requesting model implementation packages and adding them on the fly ).
Extension API and guidelines (incorporating new libraries as sub-models)
Model composition / tuning implementation

Add tools to estimate resource requirements

Agnostic data container proposal

How about overloading the getindex and setindex methods for each container we want to support to give the different containers common functionality? We can do this without interfering with the existing index methods (or, worse, wrapping the containers in a common struct) as follows:

# types for dispatch:
struct Rows end
struct Cols end
struct Names end

Base.getindex(df::AbstractDataFrame, ::Type{Rows}, r) = df[r,:]
Base.getindex(df::AbstractDataFrame, ::Type{Cols}, c) = df[c]
Base.getindex(df::AbstractDataFrame, ::Type{Names}) = names(df)

Base.getindex(df::JuliaDB.Table, ::Type{Rows}, r) = df[r]
Base.getindex(df::JuliaDB.Table, ::Type{Cols}, c) = select(df, c)
Base.getindex(df::JuliaDB.Table, ::Type{Names}) = getfields(typeof(df.columns.columns))

Base.getindex(A::AbstractMatrix, ::Type{Rows}, r) = A[r,:]
Base.getindex(A::AbstractMatrix, ::Type{Cols}, c) = A[:,c]
Base.getindex(A::AbstractMatrix, ::Type{Names}) = 1:size(A, 2)

Base.getindex(v::AbstractVector, ::Type{Rows}, r) = v[r]

Then, for example, df[Rows, 3:7] returns rows 3-7 of df whether df is a DataFrame, a JuliaDB Table, a Matrix or a vector.

Add tuning by stochastic search

We have a Grid tuning strategy but should add a stochastic tuning strategy Stochastic <: TuningStrategy with a corresponding fit method for TunedModel{Stochastic, <:Model}. The implementer should aquaint themselves with the nested parameter API (see [src/parameters.jl] and [test/parameters.jl]). To this end, I suggest first giving the iterator(::NumericRange, resolution) and iterator(::NominalRange) methods stochastic versions, perhaps by adding with a keyword argument stochastic=true.

Add tuning by gradient descent using auto-differentiation

Flux provides a nice AD interface plus SDG optimisers, and this interface is being actively developed.

Literature discussion

https://github.com/EpistasisLab/tpot
MLR

use MLMetrics to remove redundant code

Looks like there's a plan to use MLMetrices for basic utilities (as given in poster and discussed on the call). Essentially, most of the code in https://github.com/alan-turing-institute/MLJ.jl/blob/master/src/metrics.jl is the re-implementation of MLMatrices functions. Would be good to maintain uniformity and use the already implemented and tested functions.
@ablaom Need your go-ahead on this before making a PR.

Implement NaiveBayes.jl

NaiveBayes.jl

Recent proposal for design of package interface

Please provide any new feedback on the proposed glue-code
specification
below. @fkiraly has posted some comments
here. It
would be helpful also to have reactions to the two bold items below.

I will probably move the “update” instructions for the fit2 method to
model hyperparameters, leaving keyword arguments for package-specific
features (not so many use cases). It will be simplified, made into an
argument-mutating function without data as arguments. (If data really
needs to be revisited, a reference to it can be passed via cache.) The
document will explain use cases for this better.

I will require all Model field types to be concrete.

Immutable models. To improve performance, @tlienart has
recommended making models immutable. Mutable models are more
convenient because they avoid the need to implement a copy function,
and you can make a function (eg, loss) a hyperparameter (because you
don't need to copy it). The first annoyance can be dealt with (mostly)
with a macro. To deal with the second you replace a functions with
concrete type ("reference") and use type dispatch within fit to get
the function you actually want. Or something like that. In particular,
you need to know ahead of time what functions you might want to
implement. For unity, we might want to prescribe this part of the
abstraction (for common loss functions, optimisers, metrics, etc)
ourselves (or borrow from an existing library).

When I wrote my flux interface for Koala I found it very
convenient to use a function as a hyperparameter to generate the
desired architecture, essentially because a "model" in flux is a
function. (I suppose one could (should?) encode the architecture
a la Onnx or similar).

My vote is to keep Models mutable to make it more convenient for
package interfaces writers and because I'm guessing the performance
drawbacks are small, However, others may have a more informed opinion
than I do. For what it is worth, Scikitlearn.jl has mutable models.

What do others think about making models immutable?

Defaults for hyperparmaters ranges. Is there a desire for
interfaces to prescribe a range (and scale type) for
hyperparmaters, in addition to default values? (To address one
of @fkiraly comments, default values and types of parameters
are already exposed to MLJ through the package interface's model definition.)

Add tuning by genetic algorithms

We have a Grid tuning strategy but should add genetic algorithm style tuning Genetic <: TuningStrategy with corresponding fit, best and predict methods for TunedModel{Genetic,<:Model}. See the related issue #37.

Conditional package loading like

https://github.com/JuliaIO/FileIO.jl/blob/8656519a4eb45915c1eff21cac17d14054244c5f/src/loadsave.jl#L1-L50

Decoder rounding

While looking at the GaussianProcesses case, I noticed that their method spews out floats. So for instance your training labels may be 1,1,2,1,... and the return would be 0.99, 1.11, ....

Since the inverse transform expects the same type as the input type, there's an extra step needed which I coded as:

nlevels = length(decoder.pool.levels)
pred_rc = clamp.(round.(Int, pred), 1, nlevels)

But this is a bit of a hack and it seems to me this should be addressed within the inverse_transform maybe? Or maybe I missed something that was already present.

Add facility to quickly define a linear transformers-predictor pipeline

I think a macro is the easiest way to do this, given the existing learning networks API. Syntax would look something like:

composite_model =  @pipeline transformer1 transformer2 predictor

The things on the right are models and the result composite_model is just another model whose hyper parameters are called "transformer1", "transformer2", "predictor" and the (mutable) values of these are set to to transformer1, transformer2, predictor. Mutating these would mutate composite_model.

Finish XGBoost implementation.

Yiannes has done a great job with the code at src/interfaces/XGBoost.jl but it is does not meet the model spec yet. The models may need to be split further according to "objective" function as Regressor/Classifier Deterministic/Probabilistic etc. And the classifiers need to be integrated with CategoricalArrays, preserving input levels, etc

I will try to have a look at this myself soon.

Added: It is natural to break the XGBoost model into three separate models, depending on the value of the original XGBoost parameter objective:

XGBoostRegressor <: Deterministic{Any} - for reg:linear, reg:gamma, reg:tweedie(target_scitype = Continuous). MLJ objective default: objective=:linear.
XGBoostCount <: Deterministic{Any} - for count:poisson (target_scitype=Count). MLJ objective hyperparameter has :poisson as only allowed parameter value.
XGBoostClassifier <: Probabilistic{Any} - for binary:logistic, multi:softprob (target_scitype = Union{Multiclass,FiniteOrderedFactor}). MLJ objective parameter has :automatic as only allowed value.

I don't think we should implement any of the other XGBoost objective options at this time. In particular, note that reg:logistic and multi:softmax are redundant. To get these one can use the probabilistic versions and call predict_mode instead of predict. Maybe the doc string can mention this. (We do not need to implement predict_mode because there is a fall-back in MLJBase.)

Notes:

Please, let's implement and test these one at a time and organise the code the same way (ie don't interweave code for the three models, this makes it harder to review.)
We should drop the num_class hyperparameter altogether, as we are inferring this from nlevels(y). This will avoid a lot of dancing around in clean! (The original XGBoost needs this parameter, because it has no way to know the complete pool of target values.)
Since, XGBoostClassifier is probabilistic, it will predict vectors of distributions of type MLJBase.UnivariateNominal. As discussed in the guide, we will need to decode the target using decoder=CategoricalDecoder(y, Int). I suggest bundling the the decoder with the fitresult to make it available to predict, and to reconstruct the labels (in the correct order) using inverse_transform(decoder, eachindex(levels(decoder)).
To reduce code redundancy, we may want to define a macro for the model struct declarations and keyword constructors. This can be done later, but with this in mind, we should keep the hyperaparameter list the same across the threee models, even ones that don't apply. We can use clean! to make relevant warnings.

Break MLJ into MLJ and MLJBase

@MikeInnes has suggested that we extract from MLJ.jl the methods to be extended by external packages and put them into a new package MLJBase.jl. So, if I want to my package CoolRandomForests.jl to implement the MLJ.jl interface, I just need to import MLJBase.jl. The higher level abstractions (tasks, trainable models, learning networks, tuning and so forth) stays in MLJ.jl, which imports MLJBase.jl.

This is a fairly ubiquitous design pattern in Julia and I can think of no reason not to do this. Unless I objections are raised by the end of this Friday 7th, I will go ahead and implement.

Add benchmarking tools

~~Will build on task interface design~~

Investigate "interpretable machine learning" integration

Eg, model-agnostic Shapley values.

Neural Network port via Flux/Knet

The poster mentions a port to work with neural networks, via Flux and Knet to be working, but I couldn't find any interface for these packages.

Upgrade to Julia 1.0

So far, mlj has been developed with Julia 0.6. With the release of Julia 1.0, we should try to upgrade.

There are no test suites right now, so it'll involve a lot of manual testing to see what's broken.

Please comment your findings on what needs to be changed and other thoughts on the upgrading process :)

merge with gergos design and implement tuning wraper

Add DAG scheduling (e.g. Dagger.jl) to training of learning networks

Currently each node and machine in a learning network has a simple linear "tape" to track dependencies on machines in the network. I had in mind to replace these tapes with directed acyclic graphs, which (hopefully) makes scheduling amenable to Dagger.jl or similar.

A thorough understanding of the learning network interface at src/networks.jl will be needed. If someone has experience with scheduling, I could provide guidance, but this is probably not a small project.

Interface for LowRankModels.jl

(I know there's still #35 waiting to be done but the models for which there already is a ScikitLearn.jl interface are somewhat easier to write interfaces for which, hopefully, will make me confident enough to help out on more complex stuff later).

LowRankModels.jl implements quite a lot of interesting stuff:

PCA (iterative approach so should be seen as "approxPCA" afaik)
QPCA (quadratically regularised PCA)
NNMF (non neg matrix facto)
KMeans (note they explicitly say it's not the most efficient way of doing KMeans so while we may want to offer an interface for it, there may be something better later [1])
RPCA (robust PCA)

[1] how do you envisage competing algorithms for the same task? E.g. if two packages do KMeans and implement MLJBase? Or, in fact, if two packages implement PCA like here and MultivariateStats?

I'm happy to try implement an interface for these.

Learning networks and dynamic data

I have made some encouraging progress on how to design learning
networks and want to report this progress here. My solution is
inspired by Mike Innes' work on Flux. The idea is that you just
want to write down the math, and have the framework wrap this in the
appropriate logic under the hood. See also, the post, On Machine
Learning and Programming
Languages.

I will formulate my solution in terms of "dynamic data". Dynamic data
behaves superficially like regular data (e.g., a data frame) but
tracks its dependencies on other data (static and dynamic), as well as
the training events that were used to define them. You can think of
dynamic data as nodes in a learning network if you want to, but the
average user probably doesn't care.

The dynamic data type and "trainable model" type (different from the
current MLJ one) are interdependent and must be defined in just the
right way to make it all work. I think I have it now. Below is a
preview of the syntax from a working implementation (from a private
repo). I will discuss details elsewhere.

Dynamic data and look-through training

A.k.a. Learning pipelines/networks

Let's get some data (the Boston data set):

julia> using MLJ
julia> X, y = datanow(); # ALL of the data, training, test and validation

julia> # split the rows into training and testing rows:
julia> fold1, fold2 = partition(eachindex(y), 0.7) # 70:30 split
([1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  345, 346, 347, 348, 349, 350, 351, 352, 353, 354], [355, 356, 357, 358, 359, 360, 361, 362, 363, 364  …  497, 498, 499, 500, 501, 502, 503, 504, 505, 506])

Cross-validation the hard way

julia> # construct a transformer to standardize the inputs, using the
julia> # training fold to prevent data leakage:
julia> scale_ = Standardizer()
julia> scale = prefit(scale_, X)
julia> fit!(scale, fold1)
[ Info: Training TrainableModel @ ...170.
[ Info: Done.

Note here that training is split into two phases: a prefit stage, in
which hyperparameters are wrapped in all of the data, but not told
which part (rows) of the data is for training; and a final training
stage, in which we declare which part of the data we want to use. This
is slightly more complicated than the standard approach but critical
to the dynamic approach described later.

julia> # get the transformed inputs:
julia> Xt = transform(scale, X);

julia> # convert data frame `Xt` to an array:
julia> Xa = array(Xt);

julia> # choose a learner and train it on the same fold:
julia> knn_ = KNNRegressor(K=7) # just a container for hyperparameters
julia> knn = prefit(knn_, Xa, y)
julia> fit!(knn, fold1)
[ Info: Training TrainableModel @ ...838.
[ Info: Done.

julia> # get the predictions on the other fold:
julia> yhat = predict(knn, Xa(fold2));

julia> # compute the error:
julia> er1 = rms(y(fold2), yhat)
7.32782969364479

Then we must repeat all of the above with roles of fold1 and fold2
reversed to get er2 (omitted).

And then average er1 and er2 to get our estimate of the
generalization error.

Cross-validation using dynamic data and look-through training:

We will need two lines of code not used above, but everything else
will be easier and use almost identical syntax:

julia> X = dynamic(X)
julia> y = dynamic(y)

julia> # construct a transformer to standardize the inputs:
julia> scale_ = Standardizer()
julia> scale = prefit(scale_, X) # no need to train!

julia> # get the transformed inputs, as if `scale` were already trained:
julia> Xt = transform(scale, X)

julia> # convert DataFrame Xt to an array:
julia> Xa = array(Xt)

julia> # choose a learner and make it trainable:
julia> knn_ = KNNRegressor(K=7)
julia> knn = prefit(knn_, Xa, y) # no need to train!

julia> # get the predictions, as if `knn` already trained:
julia> yhat = predict(knn, Xa)

julia> # compute the error:
julia> er = rms(y, yhat)

Now er is dynamic, so we can do "look-through" training on any rows we
like and evaluate on any rows we like. Look-through training means the scaling and KNN get refitted automatically:

julia> fit!(er, fold1)
[ Info: Training TrainableModel @ ...940.
[ Info: Done.
[ Info: Training TrainableModel @ ...251.
[ Info: Done.

julia> er1 = er(fold2)
7.32782969364479

julia> fit!(er, fold2)
[ Info: Training TrainableModel @ ...940.
[ Info: Done.
[ Info: Training TrainableModel @ ...251.
[ Info: Done.

julia> er2 = er(fold1)
9.616116727127416

julia> er = (er1 + er2)/2
8.471973210386103

Integrate ScikitLearn.jl models

With a view to adding functionality quick, it has been proposed that we write MLJ package interfaces for some (most) scikit-learn models, ideally in a semi-automated way.

A first step to be addressed in this issue is to investigate the best way to go about doing this. We should follow a path of least resistance; performance is not an issue.

To proceed some familiarity with the MLJ (package interface spec)[https://github.com/alan-turing-institute/MLJ.jl/blob/master/doc/adding_new_models.md] will be required. This is not totally stable, but changes henceforth should be minor. An important part is the specification of model metadata, which is not explicitly exposed in the scikit-learn API and will probably need to be extracted manually, or at least in a supervised fashion. This is needed to help the user connect a prescribed task with the right models.

Two options I can think of to check out:

wrap the ScikitLearn.jl models.
wrap the original scikit-learn python/Cython models directly but probably imitating the way ScikitLearn.jl does this.

Also, have a look at CombineML.jl, which has interfaces to scikit-learn (.jl?) models to see how they do this.

And start to think about how the metadata might be extracted in whatever way we go with.

Organisation of MLJ interfaces

In response to query about competing algorithms: Competing algorithms should be fine and they can even have the same name, which one handles by only importing the relevant packages:

import BobsTrees
import JanesTrees

model1 = BobsTrees.TreeRegressor()
model@ = JanesTrees.TreeRegressor()

However, this doesn't work with lazy loaded models (as currently implemented) because import DecisionTree (with DecisionTree begin lazily loaded) imports all the DecisionTree models into the global namespace without the qualifying package name. I don't know how to fix this unwanted behaviour, except to have something ugly like:

import DecisionTree    # lazily loads interface module _DecisionTree

model = _DecisionTree.TreeRegressor

I suppose an alternative to lazy loading would be to have a separate repo MLJInterfaces.jl which implements the MLJBase interface for all models that do not natively implement it (currently none do). (This is what IterableTables does for the the Query TableTraits interface). The workflow for using a model in a package that does not natively support the MLJ API is then identical to the normal one, except for one step. First import MLJInterfaces which will load all the interfaces (and dependencies) of those packages.

Importing package from other paths

Looks like using MLJ doesn't work at the moment (from any other directory), the only way is to change directory to the mlj folder and then import it include("MLJ.jl"). Won't it be better to make this an independent package? Also, some docs might be good.
@vollmersj

Proposal for metadata

I'm inviting feedback on a suggestion for encoding metadata.

We would like to associate certain metadata with models (most of these being defined in external packages). The main purpose of the metadata is so we can mimic the R task interface, which allows a user to match task specifications (e.g., I want a classifier that handles nominal features) to a list of qualifying models.

I expect a local registry will store the model metadata, with a macro call updating the registry each time a model is defined (which means when the user imports the relevant external package, in the case of lazily loaded interfaces).

We suggest metadata consist of:

list of model properties (see below for a list)
list of supported operations (predict, predict_proba, inverse_transform, etc)
the allowed data types for inputs (and target)

Note that at present, the only the subtypes Model (our abstract type for the hyperparameter containers) are Supervised and Unsupervised; so Regression, Classification and MultiClass are just properties.

In the core code we do something like this:

abstract type Property end   # subtypes are the allowable model properties

""" Models with this property perform regression """
struct Regression <: Property end    
""" Models with this property perform binary classification """
struct Classification <: Property end
""" Models with this property perform binary and multiclass classification """
struct MultiClass <: Property end
""" Models with this property support nominal (categorical) features """
struct Nominal <: Property end
""" Models with this property support features of numeric type (continuous or ordered factor) """
struct Numeric <: Property end
""" Classfication models with this property allow weighting of the target classes """
struct Weights <: Property end
""" Models with this property support features with missing values """ 
struct NAs <: Property end

And model declarations look something like this:

mutable struct DecisionTreeClassifier{T} <: Supervised{DecisionTreeClassifierFitResultType{T}} 
    pruning_purity::Float64 
    max_depth::Int
    min_samples_leaf::Int
    min_samples_split::Int
    min_purity_increase::Float64
    n_subfeatures::Float64
    display_depth::Int
    post_prune::Bool
    merge_purity_threshold::Float64
end

# metadata:
properties(::Type{DecisionTreeClassifier}) = [MultiClass(), Numeric()]
operations(::Type{DecisionTreeClassifier}) = [predict]
type_of_X(::Type{DecisionTreeClassifier}) = Array{Float64,2}
type_of_y(::Type{DecisionTreeClassifier}) = Vector

transform for Clusterring and PCA should return tables

In general transformers should return a table with the preferred sink type of X, where X is the table being transformed. Currently a matrix is being returned by KMeans, KMedoids and PCA. If Xout is this matrix, we should return instead MLJBase.table(Xout, prototype=X)

Dimension convention and table <--> matrix conversions

AFAIK, we use the n x p convention where n is the number of observations. It seems to me that we should however make the whole machinery able to take transposes (especially after ranting against other packages not offering this by hardcoding Matrix{T}).

A fix would be to replace all occurrences of ::Matrix{Float64} by ::AbstractMatrix{Float64}.

Thoughts?

Port homogeneous ensembles from Koala

Stheno.jl for GPs

I think https://github.com/willtebbutt/Stheno.jl might be a more rigorous test case than GaussianProcesses.jl on which to condition MLJ's architecture.

It has a more flexible and involved modeling syntax and full posterior available for sampling/prediction. Also composable with turing.jl and flux CC: @willtebbutt

Related efforts in the Julia ecosystem. PP, autoML, formulae, visualization, and others.

Very exciting to learn about this effort! A julia native ML package improving on Sklearn is one of the key missing pieces of the ecosystem.

Here's a list of ideas I'd like to bring to your attention, if you haven't considered them already. Some would be very long term projects, that I hope to help with, if they are even within scope. I can open issues for any that deserve their own.

There's already been a relatively developed (yet stalled) effort for something along these lines in the JuliaML ecosystem. Might want to consider integration or lifting of ideas: https://github.com/JuliaML
Integration with prob programming framework like turing.jl would be really cool. They have non gradient samplers that can work with arbitrary julia code, along with HMC. Would be cool if point parameters and priors can be mixed in a model (or model search as prob program induction) with different sampling/optimization strategies. https://github.com/TuringLang/Turing.jl cc @yebai
Regarding architecture search and "automl", here are some python exemplars: https://github.com/automl/auto-sklearn, https://github.com/jhfjhfj1/autokeras, https://github.com/EpistasisLab/tpot. tpot can optimize over non differentiable pipelines using genetic programming.
Tables.jl is an alternative table interface which much of the stats ecosystem is coordinating around by @quinnj which can also hook into interable tables, though IIRC there are some issues with missing data interop.
One of those with integration is statsmodels.jl, which has a very powerful formula interface that works with abstract tables. Would be cool (and an improvement over sklearn) to integrate with this.
I'm looking to work on some graph NN stuff, so would be great to have support for non euclidean input data ala https://github.com/rusty1s/pytorch_geometric . This is both bring NNs to graphs but also bringing graphs to NN as useful "inductive biases : https://arxiv.org/abs/1806.01261
Yellowbrick type Plots.jl or Makie recipes

Package it as so to work with Pkg.clone

Treatment of supervised models predicting an ordered categorical

Currently, target_kind can be :nominal or :numeric, with :numeric including ordered categoricals. These last must be represented as integers by the user but predictions can be continuous (or pdf's with continuous support). Is this satisfactory, or do we want to formally separate out the ordered categorical case?

The new proposal could be:

(i) replace [:nominal, :numeric] options with [:factor, :ordered_factor, :continuous]
(ii) implementers of the interface for an algorithm with :ordered_factor targets can expect the target(s) to be ordered CategoricalVector's for case of finite number of classes and Vector{<:Integer} for the infinite case.
(iii) in infinite class case, predict can output a continuous value (float) if available. In finite case a discrete prediction is required.

Cleaning up the code in the repository

Parameter interface and constraint description

https://github.com/JuliaOpt/MathProgBase.jl

Saving metadata about wrapped but not loaded algorithms

Data about wrapped learner need to be present even if Module is not installed

FILE.IO
METADATA.JL

MLR meta data

Add model selection tools

For starters: Add a tool to have models compete, based on paired cv scores?

Integrate flux models

Would be good to have some flux integration

Unsupervised learning interfaces - is transformer too narrow?

Regarding unsupervised models such as PCA, kmeans, etc discussed in #44.

I know these are commonly encapsulated within the transformer formalism, but it would do the methodology behind them injustice as feature extraction is only one major usage cases of unsupervised models. More precisely, there are, as far as I can see, three use cases:

(i) feature extraction. For clusterers, create a column with cluster assignment. For continuous dimension reducers, create multiple continuous columns.

(ii) model structure inference - essentially, inspection of the fitted parameters. E.g., PCA components and loadings. Cluster separation metrics etc. These may be of interest in isolation, or used as an (hyper-parameter) input of other atomic models in a learning pipeline.

(iii) full probabilistic modelling aka density estimation. This behaves as a probabilistic multivariate regressor/classifier on the input variables.

For the start if makes sense to implement only "transformer" functionality, but it is maybe good to keep in mind for implementation that eventually one may like to expose the other outputs via interfaces. E.g., the estimated multivariate density in a fully probabilistic implementation of k-means.