juliaml / lossfunctions.jl Goto Github PK
View Code? Open in Web Editor NEWJulia package of loss functions for machine learning.
Home Page: https://juliaml.github.io/LossFunctions.jl/stable
License: Other
Julia package of loss functions for machine learning.
Home Page: https://juliaml.github.io/LossFunctions.jl/stable
License: Other
So this is more of a discussion. I though naively it may be nice to make use of the new broadcast syntax (e.g. sin.([1,2,3])
) for our loss functions.
It is not hard to make work. We would just have to define some dummy methods. That said, our manual implementation seems superior at the moment
Base.getindex(l::Loss, idx) = l
Base.size(::Loss) = (1,)
After compiling this is the result
julia> @time value.(L2DistLoss(), [11,-3,7,14], [1,2,3,4])
0.000522 seconds (108 allocations: 3.359 KB)
4-element Array{Int64,1}:
100
25
16
100
As comparison this is our manual implementation
julia> @time value(L2DistLoss(), [11,-3,7,14], [1,2,3,4])
0.000014 seconds (7 allocations: 496 bytes)
4-element Array{Int64,1}:
100
25
16
100
Thoughts?
What are they, and what functionality and traits do they imply?
Here's a list to start discussion:
fit
:
fit!
:
transform
/predict
:
evaluate
Others?
Notes:
const predict = transform
as a convenience is ok in my eyesI like that you've kicked this off with real code, but I hope that you're still open to re-thinking everything from the ground up. This may involve changing or replacing code that you already have, but of course there will be a good reason for any changes.
With that said, I think the first important step is to define the scope of the problem as best we can:
Like I said in the roadmap discussion, I think defining a type hierarchy at the beginning is just asking for failure. Some things will require types, sure, but I do feel like premature-typing tends to ruin some of Julia's strengths. This is not an object oriented language, and building an object oriented framework will limit the power of LearnBase.
Lets compile a full scope of the problem first, and then we can start to discuss design specifics after.
I'm not sure if this belongs here, or maybe MLDataUtils.jl? I want to develop an abstraction surrounding datasets, likely similar to: https://github.com/tbreloff/OnlineAI.jl/blob/dev/src/nnet/data.jl#L136
The concept is that there are many types of data sets, and many ways to access that data. It would be nice if the models/solvers did not need to know about any of this, and we could reduce much of the interface to simple iteration.
Desired API:
# given input/target data in some format, get n random samples from the dataset
for (input, target) in RandomSampler(data, n)
output = fit!(model, input, target)
# do something with output?
end
# k-fold cross validation.. partition the dataset (without data copies!)
for (train, validate) in CrossValidation(data, k)
for (input, target) in RandomSampler(train, n)
# fit, etc
end
# access each pair once, sequentially
for (input, target) in DataIterator(validate)
# do something with the validation set
end
end
I think you can get the idea. This would be a nice pattern to have available, partly because it could handle many different types of incoming data, and partly because it cleanly separates data from algorithms.
Thoughts?
I was adding a QuantileLoss type when I noticed this. In statistics, residual = y - yhat
. Our distance losses are implemented as if residual = yhat - y
, so derivatives are missing a minus sign.
# L(y, yhat) = abs(y - yhat)
value(loss::L1DistLoss, residual) = abs(residual)
deriv(loss::L1DistLoss, residual) = sign(residual) # missing a minus sign
There's several places in the code with h(x) - y
(plot recipes, docs). Are there reasons for this, or can we switch to statistical notation?
@tbreloff I started by using a few things that I have already implemented. Why don't you establish the org and we move this there and we iterate on it.
Near plans:
EDIT: plans are discussed and iterated on in other issues
Just had a nice lunch with @tbreloff @joshday and @jmxpearson where we discussed having a stand-alone package for online optimization (stochastic gradient descent, ADAM, etc.). I hadn't realized so much had been implemented already in LearnBase. I'm opening this mostly to start a discussion where I can get caught up to speed before I go off and spend time re-implementing the work done here. This issue of course dovetails off previous conversations, but I felt it more appropriate to start a new thread.
First off, great work -- I really like the motivation for this package and it is beautifully organized. But I'm a bit confused as to what the scope of this project is and how I can contribute and work synergistically. Does this package just specify and interface or set of functions that researchers will import and extend in their own packages? Or will the actual code for specifying and optimizing various models all live here?
Specific comments/questions:
Suppose I'm working on a dataset with n
observations of p
measured features. For simplicity let's say all measurements are real-valued and I want to use L2DistLoss
. However, each feature has a difference variance, which could happen for a variety of reasons.
As a result, I want to scale the loss of each feature so that errors in the noisier features are penalized less. For example, if p = 2
I might calculate the full objective function like this:
for n = 1:nobs
# penalty for first feature with scale 1.0
objective += 1.0*value(L2DistLoss(), actual[n,1] - prediction[n,1])
# penalty for second feature with scale 2.0
objective += 2.0*value(L2DistLoss(), actual[n,2] - prediction[n,2])
end
My question is whether it would be useful to define a new type called something like ScaledLoss
so that the calculation becomes:
for n = 1:nobs
# penalty for first feature with scale 1.0
objective += value(ScaledLoss{L2DistLoss,1.0}(), actual[n,1] - prediction[n,1])
# penalty for second feature with scale 2.0
objective += value(ScaledLoss{L2DistLoss,2.0}(), actual[n,2] - prediction[n,2])
end
The advantage being that this scaling can be passed through when calculating gradients, etc.
Note: these loops are really just for demonstration. I wouldn't actually calculate the objective this way, but I think it illustrates the point.
We have a rudimentary documentation in place, but it could be a lot better.
So... I think LearnBase is ready to be registered. It's not complete, but what package is? It's got a ton of content, and I'm using it as the basis for a couple packages (OnlineAI is one). What do you think? Should we start a checklist of things to finish before we register?
This issue continues the discussion started at #4 concerning loss functions.
I thought a lot about our discussions so far and did a lot of targeted reading on the subject. Since the author of EmpiricalRisks seems very busy currently I am starting to agree that we should at least for now do our own Loss functions (we can always merge efforts later). I really want to keep the momentum going.
I suggest we simply go ahead and establish a package that is only concerned with loss functions but does those well. It should perfectly serve all our current usecases at least.
In fact, I made the first steps and outlined quite some code at LossFunctions.jl. The implementation is heavily based on the in-depth treatment of loss functions by Ingo Steinwart et al.
This is what I did so far:
Essentially I'd like to keep the type hierarchy small but the following abstract tree seems essential to me. Note how this allows to dispatch on the loss being a classifier or not (which I need for SVMs)
Cost,
Loss,
SupervisedLoss,
MarginBasedLoss,
DistanceBasedLoss,
UnsupervisedLoss,
Furthermore I have included quite a collection of verbs from the literature that seem useful. Not all Losses have to support all verbs
value,
deriv,
deriv2,
value_deriv,
value_fun,
deriv_fun,
deriv2_fun,
value_deriv_fun,
representing_fun,
representing_deriv_fun,
isminimizable,
isdifferentiable,
isconvex,
isnemitski,
isunivfishercons, # useful for multivariate
isfishercons,
islipschitzcont,
islocallylipschitzcont,
isclipable,
ismarginbased,
isclasscalibrated,
isdistancebased,
issymmetric
Concerning the Losses itself I actually want to provide two ways to interact with the most common ones.
myloss = LogitLoss()
myloss(x) # for margin- and distance- based
myloss'(x) # for margin- and distance- based
value(myloss, ...)
deriv(myloss, ...)
deriv2(myloss, ...)
value_deriv(myloss, ...)
f = value_fun(myloss) # also possible
g = deriv_fun(myloss)
logit_loss(...)
logit_deriv(...)
logit_deriv2(...)
logit_value_deriv(...)
for now I have just implemented LogitLoss
, L1Loss
and L2Loss
as examples. But once we come to some agreement I would happily implement any loss that I can think of.
What do you think? Any suggestions or criticism?
A generalization of #48 (and as such a substitution of it) for any margin-based loss; It should be realized in the form of a decorator (similar to what a ScaledLoss
is).
let L_a
be the weighted version of L
with weight parameter a
L_a(y, yhat) = if y==1
(1 - a)*L(yhat)
elseif y == -1
a*L(-yhat)
end
EDIT: while this is a decorator for any margin-loss, the resulting Loss itself is not margin-based
I was thinking. The main reason I had for including UnicodePlots was for the little example plots in show
. Well maybe the better place would be in the docstring for each loss function, which are predefined strings as we know. For example the result for ?HingeLoss
would include a little precomputed unicodeplot-plot at the bottom showing what the loss function and derivative look like.
https://github.com/JuliaComputing/ArrayFire.jl
Worth thinking about given that it has a CPU back-end that should work on travis. In the end we would like to run these things on the GPU.
Thoughts?
What functionality should be accessible? How should we access it? What is core? What exists already in a usable state? (see StatsBase.jl, MLBase.jl, MachineLearning.jl, ...) What should be wrapped/linked? What should be left unimplemented, waiting for 3rd party extension?
This is not a complete list... just a placeholder which we should add to.
Concepts:
Models/algorithms:
Another margin-based loss function to add to our collection is the exponential loss L(agreement) = exp(-agreement)
used in AdaBoost.
One bridge we have to cross sooner or later are multiclass problems. There are mutliclass extensions or formulations for a couple of losses that we have.
A particular interesting example is the hinge loss. In a multinomial setting the targets could be indices, such as [1,2,3,1]
, which would violate our current idea of a targetdomain
being [-1,1]
.
One solution could be to think of the multivariate version as separate of the binary one. For example we could lift it like this: Multivariate{L1HingeLoss}
. This could then have it's own targetdomain
and other properties. It would also avoid potential ambiguities when it comes to dispatching on the types targets
and output
. I am not sure we could be certain of dealing with a multivariate vs binary case just based on the parameter types
Again for completeness sake. As the name suggests a weighted version of the classification loss (zero one loss) which penalizes the errors differently according to a hyper parameter. (see p. 24, example 2.5 in Steinwart 2008)
Similar to how value
, deriv
, etc. are a function of the distance (residual) for DistanceLoss, how about changing MarginLoss to AgreementLoss, as those methods are a function of the agreement?
We currently have somewhat of an inconsistency for what type member variables of Losses are.
Some are flexible
but some are hardcoded to Float64
for little reason. We should make them flexible as well
DWDMarginLoss
It's not critical right now, but good to keep an eye on. Call overloading for abstract types has apparently gone away, which is causing travis failures:
WARNING: deprecated syntax "call(c::PredictionLoss, ...)".
Use "(c::PredictionLoss)(...)" instead.
ERROR: LoadError: LoadError: LoadError: LoadError: cannot add methods to an abstract type
in include(Core.#include, UTF8String) at /Users/travis/julia/lib/julia/sys.dylib:-1
in include_from_node1(Base.#include_from_node1, ASCIIString) at /Users/travis/julia/lib/julia/sys.dylib:-1
in include(Core.#include, UTF8String) at /Users/travis/julia/lib/julia/sys.dylib:-1
in include_from_node1(Base.#include_from_node1, ASCIIString) at /Users/travis/julia/lib/julia/sys.dylib:-1
in include(Core.#include, ASCIIString) at /Users/travis/julia/lib/julia/sys.dylib:-1
in include_from_node1(Base.#include_from_node1, ASCIIString) at /Users/travis/julia/lib/julia/sys.dylib:-1
in eval(Core.#eval, Module, Any) at /Users/travis/julia/lib/julia/sys.dylib:-1
in require(Base.#require, Symbol) at /Users/travis/julia/lib/julia/sys.dylib:-1
in include(Core.#include, ASCIIString) at /Users/travis/julia/lib/julia/sys.dylib:-1
in include_from_node1(Base.#include_from_node1, UTF8String) at /Users/travis/julia/lib/julia/sys.dylib:-1
in process_options(Base.#process_options, Base.JLOptions) at /Users/travis/julia/lib/julia/sys.dylib:-1
in _start(Base.#_start) at /Users/travis/julia/lib/julia/sys.dylib:-1
while loading /Users/travis/.julia/v0.5/LearnBase/src/loss/abstract.jl, in expression starting on line 5
while loading /Users/travis/.julia/v0.5/LearnBase/src/loss/loss.jl, in expression starting on line 2
while loading /Users/travis/.julia/v0.5/LearnBase/src/LearnBase.jl, in expression starting on line 64
while loading /Users/travis/.julia/v0.5/LearnBase/test/runtests.jl, in expression starting on line 1
The relevant issue: JuliaLang/julia#14919
I'm starting this as a catch-all for discussion about concepts that heavily overlap between disciplines. I had started to contemplate how to merge the implementation that you're developing here with the concepts/requirements that I see in OnlineStats/OnlineAI.
(sorry... this is a little rambling...)
Similar:
A link function is effectively of the form: f(y) = x'w + b
or y = f'(x'w + b)
An activation function is effectively: y = f(x'w + b)
... the name is more a holdover from the perceptron's discontinuous {0,1} classification, but over time has seemed to blend more with the concept of a link function. So we see a "sigmoid activation" is effectively the same as the "inverse logit link" in the GLM sense. It's valuable to think of a neural net with no hidden layers to be conceptually similar to a GLM... we are transforming the inner product of inputs and weights (plus bias), with the goal of maximizing the likelihood that w/b generate the relationship.
Now, neural nets are built from connected layers of these GLMs, which just happen to have solvers that exploit the known structure to provide better "measures of fit" for the earlier layers. The likelihood is in relation to the whole of the structure, as opposed to just a single function of f(x'w + b)
.
Lets jump ahead and think of how we can apply a "layering" concept to other areas. You mentioned in another issue that you would like to do something like:
cs = fit(CenterScale(), Xtrain)
Xtrain2 = predict(cs, Xtrain)
pca = fit(PCA(), Xtrain2)
Xtrain3 = predict(pca, Xtrain2)
myfit = fit(MyModel(), Xtrain3)
yhat = predict(myfit, Xtrain3)
In some sense this is an extremely specific 3-layer neural net, and I think some abstractions can be seen. Assume we have 3 input variables, and 1 target. Imagine instead:
y1 = f1(x'w1 + b1)
, with w1 the identity matrix, and f1 the identity function.y2 = f2(y1'w2 + b2)
, with w2 as the principal components loading matrix, f2 the identity functiony3 = f3(y2, w3, b3)
Thinking about things like this I think will open your mind to some alternative ways to think about modeling pipelines... specifically why are you focused on de-meaning the inputs, or reducing dimensions using PCA, if your final goal is really to have some larger function which maps initial inputs to final target values. The concept behind PLS is very similar... instead of 2 independent steps of "reduce dimension, then fit", it combines into a unified step "reduce dimensions and fit". This is how neural nets work as well, and backprop just gives a method to fit the weights of earlier models using the chain rule and intermediate gradients.
There are also parallels to a lot of the deep learning approaches... namely the use of Restricted Boltzmann Machines and Stacked Autoencoders (both have heavy parallels with PCA and dimensionality reduction) as inputs into a simpler predictive model. There has been good success in using these models/algorithms simply as weight initialization in a larger system, where those models are effectively re-fit using the final target as driving the weight updates, here with a heavy parallel to PLS.
The point I'm trying to get across with all these tangents is that I think it can be very helpful when designing, to think about how simpler models (pca, svm, glm, ...) could be reinterpreted in terms of a more complicated framework. With the right abstractions, we could recreate something like PLS automatically by plugging the generalized PCA with generalized logistic regression.
@Evizero I'm working on a new recurrent neural net framework within OnlineAI, and I'm considering whether I can switch the loss models to those in LearnBase. I see you haven't pushed any updates in a while... what is your latest thinking on this repo? There are a few minor aspects that I would consider changing (remove UnicodePlots dependency, add "Abstract" to some of the type names, etc), as well as some conventions (multiplying by 0.5 for L2 loss, for example)... but I expect most of this is good as-is. What's your opinion on the current state of the package, and would you be happy for me to submit PRs to change some of this stuff to my liking?
Right now the README is in a very old and not very useful state
It should include description for
At the moment the minimum of the PoissonLoss
is non-zero. This does not change the answer to whatever optimization problem you are solving, so I am inclined to leave it as is for now.
If we decided we want the convention to be zero loss for perfect fit, we'd have to make the following change:
# current:
value(::PoissonLoss, t::Number, o::Number) = exp(o) - t*o
# add a shift:
value(::PoissonLoss, t::Number, o::Number) = exp(o) - t*o + t*log(t) - t
This convention may have some advantages, but I'd have to think carefully about it. A downside is that it adds potentially unnecessary computations.
Corallary: If we don't do the re-shifting, it might be nice to add some functions:
minimum(::PoissonLoss, target::Number) = t*log(t) - t # best achievable value
argmin(::PoissonLoss, target::Number) = log(t) # value of output to achieve min
I noticed the misspelling in some doc strings.
Weighted losses seem to be more general than the loss functions considered here. I personally want it so that why I can also specify time-weighted regressions like the two-stage method here using the same infrastructure as LossFunctions.jl. Other places where this comes up are things like LOESS. It would be a very small expansion to the API to allow passing a weight vector (as optional, defaults to ones) and weight function, but it seems like it could have many uses.
I think there are a lot of issues for handling sparse data. The main one I want to highlight is for sparse binary data we may want to rethink the {+1,-1} encoding convention. Consider applications where you are trying to detect sparse, binary events. Then I think it makes sense for the target to be a sparse vector and for us to interpret the zeros as -1
s.
using LossFunctions
t = [-1,1.]
t_sparse = sparse([0,1.])
x = randn(2);
value(LogitMarginLoss(),t,x)
2-element Array{Float64,1}:
0.0806863
0.0330961
value(LogitMarginLoss(),t_sparse,x)
2-element Array{Float64,1}:
0.0454645
0.0330961
There are a variety of efficiencies we could implement in this case as well.
It would maybe be nice to describe the implemented properties. For example the description for isconvex
could state a short formal or informal definition for what makes a loss convex. etc
note to self: link to places. add reference in contributing issue
right now there is still this placeholder code https://github.com/Evizero/LearnBase.jl/blob/master/src/optim/optim.jl#L6-L26 I was working on. I think the current best way would be to utilize Optim.jl using requires
I remember the question being raised why the ZeroOneLoss
is not a subtype of MarginLoss
(although I can't find the associated issue).
I investigated a bit to retrace my steps. The book I have been using as a guide states the classification Loss is not margin-based, yet our implementation seems to fit the definition. I found out that the reason for this is that I simply did not implement the loss as it is described in the book (which is L(y,t) = 1 (y sign t), where the 1 is an indicator function). The book version clearly isn't margin based, since it can't be written as L(y*t).
At some point I must have been compelled to try and formulate the Loss as a margin-based one, but then forgot to reflect that in the type tree.
tl;dr: Our implementation is margin-based. will fix.
I just noticed grad_fun
and value_grad_fun
are exported but not defined anywhere.
I like the idea of having a standard bank of loss functions and parameter penalties. One thing that would be very useful to compute for all instances of this would be proximal mappings. These form the basis for a large class of optimization algorithms for non-smooth functions (authoritative review here: http://stanford.edu/~boyd/papers/prox_algs.html).
The definition of the proximal operator/mapping is:
prox(f,x0,rho) = argmin_x f(x) + (0.5/rho)*norm(x-x0)^2
Where f
is the function (typically a penalty or loss), x0
is the current parameter guess, and rho
tunes the step size. If rho is small (near zero) then the parameter update (x - x0
) is in the direction of the gradient -- i.e. the proximal mapping performs gradient descent (albeit with small step sizes). See the review above for more intuitions.
To start, I would propose two new functions prox
and prox!
for loss and penalty functions. For the L1 penalty:
function prox(r::L1Penalty, x0::Array, ρ::Float64)
return max(0, x0 - ρ) - max(0, -x0 - ρ)
end
function prox!(r::L1Penalty, x0::Array, ρ::Float64)
for i = 1:length(x0)
x0[i] = max(0,x0[i] - ρ) - max(0, -x0[i] - ρ)
end
end
I have some rough code here, which I am happy to port over: https://github.com/ahwillia/ProxAlgs.jl
I also have some optimization routines implemented in that package, like ADMM. I could port those over as well, but as I brought up in #22 -- I'm still a bit conflicted over whether we should be fleshing out full optimization routines in this package.
Corollary 1: The grad
and grad!
functions are a bit tough for me to parse by just perusing over the source code. Is there a reason for these to be so closely tied to risk models? I think it makes sense for prox
and prox!
to mirror how we calculate gradients.
Corollary 2: Have we thought about how to represent objective functions with multiple penalties. For example, I have to implement different prox operators for L1
, L2
, and ElasticNet
, even though elastic net is just a linear combination of L1
and L2
penalties. In other words the prox(f+g,x0,r) =\= prox(f,x0,r) + prox(g,x0,r)
DWD is a differentiable generalization of L1HingeLoss (but different from SmoothedL1HingeLoss). Is DWDMarginLoss
a good enough name?
See page 11. https://arxiv.org/pdf/1508.05913.pdf
I'll abuse this issue to store screenshots used in documentation and readme. Not the cleanest solution there ever was or will be, but certainly convenient 🐢
I just realized that I named the method wrong in the beginning. There are subtle differences between being strictly convex and strongly convex, and right now I marked all the losses as strongly convex when I actually should have marked them as strictly convex.
I don't think the logistic margin loss is strongly convex for example, while it is strictly convex.
I think this is wrong. I just did a quick back-of-the-napkin derivation, and I think the "deriv" should be dE/dy (not dE/ds):
code should probably be:
function value(l::CrossentropyLoss, y::Number, t::Number)
if t == 0
-log(1 - y)
elseif t == 1
-log(y)
else
-(t * log(y) + (1-t) * log(1-y))
end
deriv(l::CrossentropyLoss, y::Number, t::Number) = (1-t) / (1-y) - t / y
Note I switched y/t in the value function (I'm assuming y is the estimated probability and t is the target value)
Note that these calculations only make sense for y in (0,1)... if y is exactly 0 or 1 then we start to get infinities/NaNs. Do we want to check for this somehow?
Note that most computations actually care about the sensitivity of the error to the input to a sigmoid function. So if y = sigmoid(s)
, we actually care about delta = dE / ds = (dE / dy) * (dy / ds)
. When you work out the math of the derivative of the sigmoid times the derivative of the cross-entropy function, you end up with the simple: delta = y - t
, which is what you see everywhere. The problem is that for generic libraries, the two derivatives need to be computed in different abstractions (loss vs activation).
I think it would be nice to add something along the lines of the following:
l1 = LogitMarginLoss()
l2 = HingeLoss()
l3 = PoissonLoss()
l4 = L2DistLoss()
domain(l1) # returns -1:2:1
domain(l2) # or maybe... BinaryVariable (a type we define)
domain(l3) # all non-negative integers... how to represent?
domain(l4) # returns Real
"domain" might be the wrong name or at least misleading because the losses are defined over all real numbers... But I want to access the possible values of the data.
As a last step before registering
The margin formulation of the least squares loss. Pretty much the L2HingeLoss but not truncated at agreement >= 1
. Not the most interesting loss, but for completeness sake it would be nice to have it in here. (see p. 35, example 2.26 in Steinwart 2008)
I found a margin-based loss mentioned in the SVM book I have not encountered before referred to as the sigmoid loss defined as L(agreement) = 1 - tanh(agreement)
.
For matrix decompositions, factor analysis, similar models the predicted object is a matrix or higher-order array of data. In this case, the total loss could be calculated
sumvalue(loss::Loss, target::Matrix, output::Matrix)
As mentioned in passing in #27, currently sumvalue
requires target::AbstractVector
which forces me to vectorize target
before calling the function:
sumvalue(loss, view(target,:), view(output,:))
This is a fine workaround for now.
But my preference would be to have sumvalue
accept target
and output
as AbstractArrays
and check that they have the same dimensions (throwing an error if the check fails). Then it iterates over the two arrays and computes the total loss.
Some issues are upstream, but a few things could be done already
call
overloadsSubArrays
using slice
and sub
While snooping around at OnlineStatsModels.jl, I saw that Josh had a good use-case for a ScaledLoss
that I had not thought about before here. Basically a 0.5
times scaled version of the L2DistLoss
, that is quite common (especially in introduction ML literature, and I would guess in Statistics).
As it is now, we offer no way to create a typealias
for a scaled version of a loss, just an instance.
In Josh's use-case above it would make sense to be able to write typealias LinearRegression ScaledDistLoss{L2DistLoss,0.5}
. We could make this happen by putting the scale factor into the type itself. I don't see a downside to that, other than type instability when writing 0.5 * L2DistLoss()
, but this kind of code is probably not found in some inner loop, and one could always write ScaledDistLoss{L2DistLoss,0.5}()
instead if it really matters.
Maybe this way I can also drop the distinction between ScaledDistLoss
and ScaledMarginLoss
, which would be preferable
Continuing the discussion from #8...
I'd like to plan out the long term goals/terminology of the computation pipeline for fitting and using various ML models. Here's my current thinking... lets try to poke holes in its viability using some real-world models that should be able to handle:
Empirical risk minimization is a (IMO confusing) name for minimizing a function of model loss (we currently call this PredictionLoss
... do you like ModelLoss
instead?) and parameter loss: L(loss, ploss)
. I would prefer we migrate away from the ERM name, even if it is a "descriptively correct" identifier. Maybe we could just call it TotalLoss
:
immutable TotalLoss{ML <: ModelLoss, PL <: ParameterLoss} <: Loss
mloss::ML
ploss::PL
end
To sum up a typical real-world problem... we want to build a 2-class classifier for a dataset. To build a logistic regression:
data --> filter_columns --> filter_rows --> replace_NAs --> add_transformed_vars
--> learnable_affine_transformation --> SigmoidActivation --> LogitDistLoss
target --> LogitDistLoss
Now we want to do the same thing for an SVM... it's pretty similar, but there's probably no need to add transformed vars (cross multiplying xi * xj for example) because that's what the kernel is for:
data --> filter_columns --> filter_rows --> replace_NAs
--> learnable_affine_transformation --> LogitMarginLoss
target --> LogitMarginLoss
An ANN would be similar to the logistic regression, but there would be many affine_transformation --> activation
in sequence.
A decision tree may not need the filter/transform steps (or they may be slightly different), but their learnable transformation would be a "tree of filter transformations".
In all of these examples, there are some parts of the pipeline that can be reused. If one was to create many different models, whether as part of an ensemble or something else, I think treating these pipelines as directed graphs could give some great opportunities for optimization/caching and visualization.
Now a random forest would be an aggregation of decision trees, so we should be able to cover that as an EnsembleTransformation
, which takes multiple other transformations as input and outputs a single vector. In theory, there's no reason we can't add an ensemble as part of a fitting/backprop algorithm, passing error sensitivity/responsibility to individual models through the aggregation. It would also open up SGD-style training of layers of ensembles (I'm picturing the US Electoral College voting process as an example).
The requirement is that an ensemble shares some basic verbs with "constant transforms" (such as taking a square root) and more complex transforms (like a decision tree). It must be able to go forward
to update its internal state and produce an output value, and it must be able to go backward
to update gradients/etc, compute error responsibilities, and adjust learnable parameters.
Anything that takes in an input vector(s) and produces an output vector can be considered a Transformation
. If we can be smart about how to connect various transformations during the forward
and backward
passes of a learning algorithm, then it opens up some really nice opportunities in building, optimizing, and visualizing modelling pipelines, as well as potentially making ensemble-type aggregation part of the supervised learning step.
It would be cool to be able to specify some function from https://github.com/JuliaML/PenaltyFunctions.jl to use as a regularization term in a loss function. This will make this setup able to capture most of the common loss functions more easily.
We should also implement the Pinball loss aka Quantile Loss. (see p. 44, example 2.43 in Steinwart 2008)
There seems to be a lack of testing for the typestability of the deriv
methods, which I thought I had covered. Furthermore this allowed some of our implementations to be sub-optimal in terms of type-inference
For example this function doesn't look type-stable for agreement::Int
:
function deriv{T<:Number}(loss::SmoothedL1HingeLoss, agreement::T)
if agreement >= 1 - loss.gamma
agreement >= 1 ? zero(T) : (agreement - one(T)) / loss.gamma
else
-one(T)
end
end
Is this anything to be worried about?
julia> module X
using Losses
using Base.LinAlg
# errors because ambiguous
issymmetric(L2DistLoss())
end
WARNING: both LinAlg and Losses export "issymmetric"; uses of it in module X must be qualified
ERROR: UndefVarError: issymmetric not defined
This will be a pain in the ***, but I think worthwhile. The major problem with the current testing is that, on an error, the tests are halted, and so you can't see the results of all the tests. The output is also a little nicer, and things like your msg
/msg2
can be handled nicely by creating nested contexts.
I'll take this on, if you approve of the switch.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.