juliaai / mlj.jl Goto Github PK
View Code? Open in Web Editor NEWA Julia machine learning framework
Home Page: https://juliaai.github.io/MLJ.jl/
License: Other
A Julia machine learning framework
Home Page: https://juliaai.github.io/MLJ.jl/
License: Other
In (src/resampling.jl)[https://github.com/alan-turing-institute/MLJ.jl/blob/master/src/resampling.jl] I have defined an object for CV resampling strategy:
mutable struct CV <: ResamplingStrategy
n_folds::Int
is_parallel::Bool
end
CV(; n_folds=6) = CV(n_folds)
But we still need corresponding fit
and evaluate
methods for the Resampler{CV}
model, in analogy with the Resampler{Holdout}
methods. And a se
method as well. (The fitresult
of the fit
method will be the vector of cv scores; evaluate
will return the mean and se
the standard deviation / sqrt(n_folds).)
(v0.7) pkg> add https://github.com/alan-turing-institute/MLJ.jl/tree/master
doesn't seem to work, I get:
ERROR: failed to clone from https://github.com/alan-turing-institute/MLJ.jl/tree/master, error: GitError(Code:ERROR, Class:Net, unexpected HTTP status code: 404)
During the last call, various libraries where mentioned for boosting algorithms.
The three options we discussed (and from a brief search most commonly used) are :
CatBoost, which currently has no Julia interface, although someone mentioned an interest on his (developers?) side to port it.
XGBoost, which currently has a Julia interface, with a version working for v0.7 and v1.0, alongside tests and a build.jl file to automatically build it if needed when using it. Although this version is not yet "tagged", with the new package manager this is less of an issue (and I can always get in touch with them to speed this process up if necessary).
LightGBM, which has an unmaintained Julia interface, not currently working on v0.7, with no tests present. It also requires the separate building of the dependency, although with Requires / writing our own build file this can be somewhat managed.
From an ease of use perspective it seems to be wisest to integrate/copy over XGBoost, although from a performance/feature point of view a different package might be more optimal if you have some suggestions.
Also some issues are open regarding packaging/upgrading such as #4, #7 and from what I can gather, these have now been fixed? (module registering the package itself?)
I was a bit surprised to see that the partition
function does not randomise but takes the first say 70%. This can be problematic when (for instance) the dataset is ordered by response (e.g. the crabs dataset).
A possible way would be to just add a keyword rand
which can take a (possibly seeded) RNG and shuffle on that?
using Random
function partition(rows::AbstractVector{Int}, fractions...; rng::Union{Nothing, AbstractRNG}=Random.seed!())
rows = collect(rows)
!(rand === nothing) && shuffle!(rng, rows)
rowss = []
if sum(fractions) >= 1
throw(DomainError)
end
n_patterns = length(rows)
first = 1
for p in fractions
n = round(Int, p*n_patterns)
n == 0 ? (@warn "A split has only one element"; n = 1) : nothing
push!(rowss, rows[first:(first + n - 1)])
first = first + n
end
if first > n_patterns
@warn "Last vector in the split has only one element."
first = n_patterns
end
push!(rowss, rows[first:n_patterns])
return tuple(rowss...)
end
happy to open a PR if deemed worthy.
PS: I guess an alternative is also to feed a randperm
to partition
.
Nested parameter sets are currently not supported we could borrow from JuMP
ps = makeParamSet(
makeDiscreteParam("kernel", values=c("polydot", "rbfdot")),
makeNumericParam("C", lower=-15, upper=15, trafo=function(x) 2^x),
makeNumericParam("sigma", lower=-15, upper=15, trafo=function(x) 2^x,
requires = quote(kernel == "rbfdot")),
makeIntegerParam("degree", lower = 1, upper = 5,
requires = quote(kernel == "polydot")))
I think we should not follow the common prediction interface for probabilistic classification which returns an array of probabilities. The reason is that this does not generalize well to multi-label classification, or regression. In fact, it does not generalize at all without various hacks and "secret conventions" that make these cases quite ugly in current interface designs such as mlr or sklearn.
I might be biased, but I'd prefer that probabilistic learners return a distribution interface such as in
https://github.com/alan-turing-institute/skpro
This avoids having to think about how exactly to represent distributions in a return object and proliferating a large set of arbitrary design conventions depending on the metadata/task.
Would someone like to implement the new MLJ interface for linear models for which julia code already exists, including:
GLM.jl for many of these
Lasso.jl which needs upgrading from 0.6.
GLM
Lasso.jl
Multivariate stats
Relevant:
Integrating OnlineStats (its online learning algorithms) and giving it an easy to use hyperparameter tuning context makes Julia even more useful for quick ML on real big data.
In order to facilitate working with multiple project members we need to first:
Then individual tasks could be assigned as:
Pkg.Clone()
for now, then think about optional dependencies, when requesting model implementation packages and adding them on the fly ).How about overloading the getindex
and setindex
methods for each container we want to support to give the different containers common functionality? We can do this without interfering with the existing index methods (or, worse, wrapping the containers in a common struct
) as follows:
# types for dispatch:
struct Rows end
struct Cols end
struct Names end
Base.getindex(df::AbstractDataFrame, ::Type{Rows}, r) = df[r,:]
Base.getindex(df::AbstractDataFrame, ::Type{Cols}, c) = df[c]
Base.getindex(df::AbstractDataFrame, ::Type{Names}) = names(df)
Base.getindex(df::JuliaDB.Table, ::Type{Rows}, r) = df[r]
Base.getindex(df::JuliaDB.Table, ::Type{Cols}, c) = select(df, c)
Base.getindex(df::JuliaDB.Table, ::Type{Names}) = getfields(typeof(df.columns.columns))
Base.getindex(A::AbstractMatrix, ::Type{Rows}, r) = A[r,:]
Base.getindex(A::AbstractMatrix, ::Type{Cols}, c) = A[:,c]
Base.getindex(A::AbstractMatrix, ::Type{Names}) = 1:size(A, 2)
Base.getindex(v::AbstractVector, ::Type{Rows}, r) = v[r]
Then, for example, df[Rows, 3:7]
returns rows 3-7 of df
whether df
is a DataFrame
, a JuliaDB Table
, a Matrix
or a vector
.
We have a Grid
tuning strategy but should add a stochastic tuning strategy Stochastic <: TuningStrategy
with a corresponding fit
method for TunedModel{Stochastic, <:Model}
. The implementer should aquaint themselves with the nested parameter API (see [src/parameters.jl] and [test/parameters.jl]). To this end, I suggest first giving the iterator(::NumericRange, resolution)
and iterator(::NominalRange)
methods stochastic versions, perhaps by adding with a keyword argument stochastic=true
.
Flux provides a nice AD interface plus SDG optimisers, and this interface is being actively developed.
Looks like there's a plan to use MLMetrices for basic utilities (as given in poster and discussed on the call). Essentially, most of the code in https://github.com/alan-turing-institute/MLJ.jl/blob/master/src/metrics.jl is the re-implementation of MLMatrices functions. Would be good to maintain uniformity and use the already implemented and tested functions.
@ablaom Need your go-ahead on this before making a PR.
Please provide any new feedback on the proposed glue-code
specification
below. @fkiraly has posted some comments
here. It
would be helpful also to have reactions to the two bold items below.
I will probably move the “update” instructions for the fit2
method to
model hyperparameters, leaving keyword arguments for package-specific
features (not so many use cases). It will be simplified, made into an
argument-mutating function without data as arguments. (If data really
needs to be revisited, a reference to it can be passed via cache.) The
document will explain use cases for this better.
I will require all Model
field types to be concrete.
Immutable models. To improve performance, @tlienart has
recommended making models immutable. Mutable models are more
convenient because they avoid the need to implement a copy function,
and you can make a function (eg, loss) a hyperparameter (because you
don't need to copy it). The first annoyance can be dealt with (mostly)
with a macro. To deal with the second you replace a functions with
concrete type ("reference") and use type dispatch within fit
to get
the function you actually want. Or something like that. In particular,
you need to know ahead of time what functions you might want to
implement. For unity, we might want to prescribe this part of the
abstraction (for common loss functions, optimisers, metrics, etc)
ourselves (or borrow from an existing library).
When I wrote my flux interface for Koala I found it very
convenient to use a function as a hyperparameter to generate the
desired architecture, essentially because a "model" in flux is a
function. (I suppose one could (should?) encode the architecture
a la Onnx or similar).
My vote is to keep Models mutable to make it more convenient for
package interfaces writers and because I'm guessing the performance
drawbacks are small, However, others may have a more informed opinion
than I do. For what it is worth, Scikitlearn.jl has mutable models.
What do others think about making models immutable?
Defaults for hyperparmaters ranges. Is there a desire for
interfaces to prescribe a range (and scale type) for
hyperparmaters, in addition to default values? (To address one
of @fkiraly comments, default values and types of parameters
are already exposed to MLJ through the package interface's model definition.)
We have a Grid
tuning strategy but should add genetic algorithm style tuning Genetic <: TuningStrategy
with corresponding fit
, best
and predict
methods for TunedModel{Genetic,<:Model}
. See the related issue #37.
While looking at the GaussianProcesses case, I noticed that their method spews out floats. So for instance your training labels may be 1,1,2,1,...
and the return would be 0.99, 1.11, ...
.
Since the inverse transform expects the same type as the input type, there's an extra step needed which I coded as:
nlevels = length(decoder.pool.levels)
pred_rc = clamp.(round.(Int, pred), 1, nlevels)
But this is a bit of a hack and it seems to me this should be addressed within the inverse_transform
maybe? Or maybe I missed something that was already present.
I think a macro is the easiest way to do this, given the existing learning networks API. Syntax would look something like:
composite_model = @pipeline transformer1 transformer2 predictor
The things on the right are models and the result composite_model
is just another model whose hyper parameters are called "transformer1", "transformer2", "predictor" and the (mutable) values of these are set to to transformer1
, transformer2
, predictor
. Mutating these would mutate composite_model
.
Yiannes has done a great job with the code at src/interfaces/XGBoost.jl but it is does not meet the model spec yet. The models may need to be split further according to "objective" function as Regressor/Classifier Deterministic/Probabilistic etc. And the classifiers need to be integrated with CategoricalArrays, preserving input levels, etc
I will try to have a look at this myself soon.
Added: It is natural to break the XGBoost model into three separate models, depending on the value of the original XGBoost parameter objective
:
XGBoostRegressor <: Deterministic{Any}
- for reg:linear
, reg:gamma
, reg:tweedie
(target_scitype = Continuous
). MLJ objective
default: objective=:linear
.
XGBoostCount <: Deterministic{Any}
- for count:poisson
(target_scitype=Count
). MLJ objective
hyperparameter has :poisson
as only allowed parameter value.
XGBoostClassifier <: Probabilistic{Any}
- for binary:logistic
, multi:softprob
(target_scitype = Union{Multiclass,FiniteOrderedFactor}
). MLJ objective
parameter has :automatic
as only allowed value.
I don't think we should implement any of the other XGBoost objective
options at this time. In particular, note that reg:logistic
and multi:softmax
are redundant. To get these one can use the probabilistic versions and call predict_mode
instead of predict
. Maybe the doc string can mention this. (We do not need to implement predict_mode
because there is a fall-back in MLJBase.)
Notes:
Please, let's implement and test these one at a time and organise the code the same way (ie don't interweave code for the three models, this makes it harder to review.)
We should drop the num_class
hyperparameter altogether, as we are inferring this from nlevels(y)
. This will avoid a lot of dancing around in clean!
(The original XGBoost needs this parameter, because it has no way to know the complete pool of target values.)
Since, XGBoostClassifier
is probabilistic, it will predict vectors of distributions of type MLJBase.UnivariateNominal
. As discussed in the guide, we will need to decode the target using decoder=CategoricalDecoder(y, Int)
. I suggest bundling the the decoder
with the fitresult to make it available to predict
, and to reconstruct the labels (in the correct order) using inverse_transform(decoder, eachindex(levels(decoder))
.
To reduce code redundancy, we may want to define a macro for the model struct declarations and keyword constructors. This can be done later, but with this in mind, we should keep the hyperaparameter list the same across the threee models, even ones that don't apply. We can use clean!
to make relevant warnings.
@MikeInnes has suggested that we extract from MLJ.jl the methods to be extended by external packages and put them into a new package MLJBase.jl. So, if I want to my package CoolRandomForests.jl to implement the MLJ.jl interface, I just need to import MLJBase.jl. The higher level abstractions (tasks, trainable models, learning networks, tuning and so forth) stays in MLJ.jl, which imports MLJBase.jl.
This is a fairly ubiquitous design pattern in Julia and I can think of no reason not to do this. Unless I objections are raised by the end of this Friday 7th, I will go ahead and implement.
Will build on task interface design
Eg, model-agnostic Shapley values.
The poster mentions a port to work with neural networks, via Flux and Knet to be working, but I couldn't find any interface for these packages.
So far, mlj has been developed with Julia 0.6. With the release of Julia 1.0, we should try to upgrade.
There are no test suites right now, so it'll involve a lot of manual testing to see what's broken.
Please comment your findings on what needs to be changed and other thoughts on the upgrading process :)
Currently each node and machine in a learning network has a simple linear "tape" to track dependencies on machines in the network. I had in mind to replace these tapes with directed acyclic graphs, which (hopefully) makes scheduling amenable to Dagger.jl or similar.
A thorough understanding of the learning network interface at src/networks.jl will be needed. If someone has experience with scheduling, I could provide guidance, but this is probably not a small project.
(I know there's still #35 waiting to be done but the models for which there already is a ScikitLearn.jl interface are somewhat easier to write interfaces for which, hopefully, will make me confident enough to help out on more complex stuff later).
LowRankModels.jl implements quite a lot of interesting stuff:
[1] how do you envisage competing algorithms for the same task? E.g. if two packages do KMeans and implement MLJBase? Or, in fact, if two packages implement PCA like here and MultivariateStats
?
I'm happy to try implement an interface for these.
I have made some encouraging progress on how to design learning
networks and want to report this progress here. My solution is
inspired by Mike Innes' work on Flux. The idea is that you just
want to write down the math, and have the framework wrap this in the
appropriate logic under the hood. See also, the post, On Machine
Learning and Programming
Languages.
I will formulate my solution in terms of "dynamic data". Dynamic data
behaves superficially like regular data (e.g., a data frame) but
tracks its dependencies on other data (static and dynamic), as well as
the training events that were used to define them. You can think of
dynamic data as nodes in a learning network if you want to, but the
average user probably doesn't care.
The dynamic data type and "trainable model" type (different from the
current MLJ one) are interdependent and must be defined in just the
right way to make it all work. I think I have it now. Below is a
preview of the syntax from a working implementation (from a private
repo). I will discuss details elsewhere.
A.k.a. Learning pipelines/networks
Let's get some data (the Boston data set):
julia> using MLJ
julia> X, y = datanow(); # ALL of the data, training, test and validation
julia> # split the rows into training and testing rows:
julia> fold1, fold2 = partition(eachindex(y), 0.7) # 70:30 split
([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 345, 346, 347, 348, 349, 350, 351, 352, 353, 354], [355, 356, 357, 358, 359, 360, 361, 362, 363, 364 … 497, 498, 499, 500, 501, 502, 503, 504, 505, 506])
julia> # construct a transformer to standardize the inputs, using the
julia> # training fold to prevent data leakage:
julia> scale_ = Standardizer()
julia> scale = prefit(scale_, X)
julia> fit!(scale, fold1)
[ Info: Training TrainableModel @ ...170.
[ Info: Done.
Note here that training is split into two phases: a prefit stage, in
which hyperparameters are wrapped in all of the data, but not told
which part (rows) of the data is for training; and a final training
stage, in which we declare which part of the data we want to use. This
is slightly more complicated than the standard approach but critical
to the dynamic approach described later.
julia> # get the transformed inputs:
julia> Xt = transform(scale, X);
julia> # convert data frame `Xt` to an array:
julia> Xa = array(Xt);
julia> # choose a learner and train it on the same fold:
julia> knn_ = KNNRegressor(K=7) # just a container for hyperparameters
julia> knn = prefit(knn_, Xa, y)
julia> fit!(knn, fold1)
[ Info: Training TrainableModel @ ...838.
[ Info: Done.
julia> # get the predictions on the other fold:
julia> yhat = predict(knn, Xa(fold2));
julia> # compute the error:
julia> er1 = rms(y(fold2), yhat)
7.32782969364479
Then we must repeat all of the above with roles of fold1
and fold2
reversed to get er2
(omitted).
And then average er1
and er2
to get our estimate of the
generalization error.
We will need two lines of code not used above, but everything else
will be easier and use almost identical syntax:
julia> X = dynamic(X)
julia> y = dynamic(y)
julia> # construct a transformer to standardize the inputs:
julia> scale_ = Standardizer()
julia> scale = prefit(scale_, X) # no need to train!
julia> # get the transformed inputs, as if `scale` were already trained:
julia> Xt = transform(scale, X)
julia> # convert DataFrame Xt to an array:
julia> Xa = array(Xt)
julia> # choose a learner and make it trainable:
julia> knn_ = KNNRegressor(K=7)
julia> knn = prefit(knn_, Xa, y) # no need to train!
julia> # get the predictions, as if `knn` already trained:
julia> yhat = predict(knn, Xa)
julia> # compute the error:
julia> er = rms(y, yhat)
Now er
is dynamic, so we can do "look-through" training on any rows we
like and evaluate on any rows we like. Look-through training means the scaling and KNN get refitted automatically:
julia> fit!(er, fold1)
[ Info: Training TrainableModel @ ...940.
[ Info: Done.
[ Info: Training TrainableModel @ ...251.
[ Info: Done.
julia> er1 = er(fold2)
7.32782969364479
julia> fit!(er, fold2)
[ Info: Training TrainableModel @ ...940.
[ Info: Done.
[ Info: Training TrainableModel @ ...251.
[ Info: Done.
julia> er2 = er(fold1)
9.616116727127416
julia> er = (er1 + er2)/2
8.471973210386103
With a view to adding functionality quick, it has been proposed that we write MLJ package interfaces for some (most) scikit-learn models, ideally in a semi-automated way.
A first step to be addressed in this issue is to investigate the best way to go about doing this. We should follow a path of least resistance; performance is not an issue.
To proceed some familiarity with the MLJ (package interface spec)[https://github.com/alan-turing-institute/MLJ.jl/blob/master/doc/adding_new_models.md] will be required. This is not totally stable, but changes henceforth should be minor. An important part is the specification of model metadata, which is not explicitly exposed in the scikit-learn API and will probably need to be extracted manually, or at least in a supervised fashion. This is needed to help the user connect a prescribed task with the right models.
Two options I can think of to check out:
Also, have a look at CombineML.jl, which has interfaces to scikit-learn (.jl?) models to see how they do this.
And start to think about how the metadata might be extracted in whatever way we go with.
In response to query about competing algorithms: Competing algorithms should be fine and they can even have the same name, which one handles by only import
ing the relevant packages:
import BobsTrees
import JanesTrees
model1 = BobsTrees.TreeRegressor()
model@ = JanesTrees.TreeRegressor()
However, this doesn't work with lazy loaded models (as currently implemented) because import DecisionTree
(with DecisionTree
begin lazily loaded) imports all the DecisionTree
models into the global namespace without the qualifying package name. I don't know how to fix this unwanted behaviour, except to have something ugly like:
import DecisionTree # lazily loads interface module _DecisionTree
model = _DecisionTree.TreeRegressor
I suppose an alternative to lazy loading would be to have a separate repo MLJInterfaces.jl
which implements the MLJBase interface for all models that do not natively implement it (currently none do). (This is what IterableTables
does for the the Query TableTraits
interface). The workflow for using a model in a package that does not natively support the MLJ API is then identical to the normal one, except for one step. First import MLJInterfaces
which will load all the interfaces (and dependencies) of those packages.
Looks like using MLJ
doesn't work at the moment (from any other directory), the only way is to change directory to the mlj
folder and then import it include("MLJ.jl")
. Won't it be better to make this an independent package? Also, some docs might be good.
@vollmersj
I'm inviting feedback on a suggestion for encoding metadata.
We would like to associate certain metadata with models (most of these being defined in external packages). The main purpose of the metadata is so we can mimic the R task interface, which allows a user to match task specifications (e.g., I want a classifier that handles nominal features) to a list of qualifying models.
I expect a local registry will store the model metadata, with a macro call updating the registry each time a model is defined (which means when the user imports the relevant external package, in the case of lazily loaded interfaces).
We suggest metadata consist of:
predict
, predict_proba
, inverse_transform
, etc)Note that at present, the only the subtypes Model
(our abstract type for the hyperparameter containers) are Supervised
and Unsupervised
; so Regression
, Classification
and MultiClass
are just properties.
In the core code we do something like this:
abstract type Property end # subtypes are the allowable model properties
""" Models with this property perform regression """
struct Regression <: Property end
""" Models with this property perform binary classification """
struct Classification <: Property end
""" Models with this property perform binary and multiclass classification """
struct MultiClass <: Property end
""" Models with this property support nominal (categorical) features """
struct Nominal <: Property end
""" Models with this property support features of numeric type (continuous or ordered factor) """
struct Numeric <: Property end
""" Classfication models with this property allow weighting of the target classes """
struct Weights <: Property end
""" Models with this property support features with missing values """
struct NAs <: Property end
And model declarations look something like this:
mutable struct DecisionTreeClassifier{T} <: Supervised{DecisionTreeClassifierFitResultType{T}}
pruning_purity::Float64
max_depth::Int
min_samples_leaf::Int
min_samples_split::Int
min_purity_increase::Float64
n_subfeatures::Float64
display_depth::Int
post_prune::Bool
merge_purity_threshold::Float64
end
# metadata:
properties(::Type{DecisionTreeClassifier}) = [MultiClass(), Numeric()]
operations(::Type{DecisionTreeClassifier}) = [predict]
type_of_X(::Type{DecisionTreeClassifier}) = Array{Float64,2}
type_of_y(::Type{DecisionTreeClassifier}) = Vector
In general transformers should return a table with the preferred sink type of X
, where X
is the table being transformed. Currently a matrix is being returned by KMeans, KMedoids and PCA. If Xout
is this matrix, we should return instead MLJBase.table(Xout, prototype=X)
AFAIK, we use the n x p
convention where n
is the number of observations. It seems to me that we should however make the whole machinery able to take transposes (especially after ranting against other packages not offering this by hardcoding Matrix{T}
).
A fix would be to replace all occurrences of ::Matrix{Float64}
by ::AbstractMatrix{Float64}
.
Thoughts?
I think https://github.com/willtebbutt/Stheno.jl might be a more rigorous test case than GaussianProcesses.jl on which to condition MLJ's architecture.
It has a more flexible and involved modeling syntax and full posterior available for sampling/prediction. Also composable with turing.jl and flux CC: @willtebbutt
Very exciting to learn about this effort! A julia native ML package improving on Sklearn is one of the key missing pieces of the ecosystem.
Here's a list of ideas I'd like to bring to your attention, if you haven't considered them already. Some would be very long term projects, that I hope to help with, if they are even within scope. I can open issues for any that deserve their own.
There's already been a relatively developed (yet stalled) effort for something along these lines in the JuliaML ecosystem. Might want to consider integration or lifting of ideas: https://github.com/JuliaML
Integration with prob programming framework like turing.jl would be really cool. They have non gradient samplers that can work with arbitrary julia code, along with HMC. Would be cool if point parameters and priors can be mixed in a model (or model search as prob program induction) with different sampling/optimization strategies. https://github.com/TuringLang/Turing.jl cc @yebai
Regarding architecture search and "automl", here are some python exemplars: https://github.com/automl/auto-sklearn, https://github.com/jhfjhfj1/autokeras, https://github.com/EpistasisLab/tpot. tpot can optimize over non differentiable pipelines using genetic programming.
Tables.jl is an alternative table interface which much of the stats ecosystem is coordinating around by @quinnj which can also hook into interable tables, though IIRC there are some issues with missing data interop.
One of those with integration is statsmodels.jl, which has a very powerful formula interface that works with abstract tables. Would be cool (and an improvement over sklearn) to integrate with this.
I'm looking to work on some graph NN stuff, so would be great to have support for non euclidean input data ala https://github.com/rusty1s/pytorch_geometric . This is both bring NNs to graphs but also bringing graphs to NN as useful "inductive biases : https://arxiv.org/abs/1806.01261
Yellowbrick type Plots.jl or Makie recipes
Currently, target_kind
can be :nominal
or :numeric
, with :numeric
including ordered categoricals. These last must be represented as integers by the user but predictions can be continuous (or pdf's with continuous support). Is this satisfactory, or do we want to formally separate out the ordered categorical case?
The new proposal could be:
(i) replace [:nominal, :numeric]
options with [:factor, :ordered_factor, :continuous]
(ii) implementers of the interface for an algorithm with :ordered_factor
targets can expect the target(s) to be ordered CategoricalVector
's for case of finite number of classes and Vector{<:Integer}
for the infinite case.
(iii) in infinite class case, predict
can output a continuous value (float) if available. In finite case a discrete prediction is required.
Data about wrapped learner need to be present even if Module is not installed
For starters: Add a tool to have models compete, based on paired cv scores?
Would be good to have some flux integration
Regarding unsupervised models such as PCA, kmeans, etc discussed in #44.
I know these are commonly encapsulated within the transformer formalism, but it would do the methodology behind them injustice as feature extraction is only one major usage cases of unsupervised models. More precisely, there are, as far as I can see, three use cases:
(i) feature extraction. For clusterers, create a column with cluster assignment. For continuous dimension reducers, create multiple continuous columns.
(ii) model structure inference - essentially, inspection of the fitted parameters. E.g., PCA components and loadings. Cluster separation metrics etc. These may be of interest in isolation, or used as an (hyper-parameter) input of other atomic models in a learning pipeline.
(iii) full probabilistic modelling aka density estimation. This behaves as a probabilistic multivariate regressor/classifier on the input variables.
For the start if makes sense to implement only "transformer" functionality, but it is maybe good to keep in mind for implementation that eventually one may like to expose the other outputs via interfaces. E.g., the estimated multivariate density in a fully probabilistic implementation of k-means.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.