juliaml / lossfunctions.jl Goto Github PK

View Code? Open in Web Editor NEW

147.0 147.0 34.0 1.27 MB

Julia package of loss functions for machine learning.

Home Page: https://juliaml.github.io/LossFunctions.jl/stable

License: Other

Julia 100.00%

classification julia loss-functions machine-learning regression

lossfunctions.jl's People

Contributors

Stargazers

Watchers

lossfunctions.jl's Issues

Core verbs

What are they, and what functionality and traits do they imply?

Here's a list to start discussion:

fit:
- Take in model, x data, y data (for supervised), fitting params
- Set model parameters appropriate for the input data
fit!:
- Take in model, x data, y data (for supervised), fitting params
- Update model parameters appropriate for the input data
transform/predict:
- Take in model, x data
- Return model predictions or transformations (for a pca model, return the reduced dimensions. for a regression, the predictions)
evaluate
- Take in model, x data, y data
- transform x and apply loss function vs y
- return loss

Others?

Notes:

If we can keep to only fit/transform, then pipelining will be more straightforward.
We can always map internally, so providing const predict = transform as a convenience is ok in my eyes
Classification is just a discrete prediction, and I think should have the same verbs as a linear regression

I like that you've kicked this off with real code, but I hope that you're still open to re-thinking everything from the ground up. This may involve changing or replacing code that you already have, but of course there will be a good reason for any changes.

With that said, I think the first important step is to define the scope of the problem as best we can:

What problems/algorithms will we support/interface? Deep neural nets? Convolutions? Random Forests? Data preprocessing? Data cleaning? Model analysis and plotting? We want to make this list as complete as possible so that it's easier to see where these components fit into the next items.
What verbs do we need? Which of those can extend StatsBase, and which need to be new? What exactly does the verb mean? What assumptions can we make when a verb is implemented for a particular model?
What are some examples of how a user would interact with these algorithms/models? How do they relate to each other?

Like I said in the roadmap discussion, I think defining a type hierarchy at the beginning is just asking for failure. Some things will require types, sure, but I do feel like premature-typing tends to ruin some of Julia's strengths. This is not an object oriented language, and building an object oriented framework will limit the power of LearnBase.

Lets compile a full scope of the problem first, and then we can start to discuss design specifics after.

Dataset Abstractions

I'm not sure if this belongs here, or maybe MLDataUtils.jl? I want to develop an abstraction surrounding datasets, likely similar to: https://github.com/tbreloff/OnlineAI.jl/blob/dev/src/nnet/data.jl#L136

The concept is that there are many types of data sets, and many ways to access that data. It would be nice if the models/solvers did not need to know about any of this, and we could reduce much of the interface to simple iteration.

Desired API:

# given input/target data in some format, get n random samples from the dataset
for (input, target) in RandomSampler(data, n)
    output = fit!(model, input, target)
    # do something with output?
end

# k-fold cross validation.. partition the dataset (without data copies!)
for (train, validate) in CrossValidation(data, k)

    for (input, target) in RandomSampler(train, n)
        # fit, etc
    end

    # access each pair once, sequentially
    for (input, target) in DataIterator(validate)
        # do something with the validation set
    end
end

I think you can get the idea. This would be a nice pattern to have available, partly because it could handle many different types of incoming data, and partly because it cleanly separates data from algorithms.

Thoughts?

Derivatives/residuals

I was adding a QuantileLoss type when I noticed this. In statistics, residual = y - yhat. Our distance losses are implemented as if residual = yhat - y, so derivatives are missing a minus sign.

# L(y, yhat) = abs(y - yhat)
value(loss::L1DistLoss, residual) = abs(residual)
deriv(loss::L1DistLoss, residual) = sign(residual)  # missing a minus sign

There's several places in the code with h(x) - y (plot recipes, docs). Are there reasons for this, or can we switch to statistical notation?

Move this to org

@tbreloff I started by using a few things that I have already implemented. Why don't you establish the org and we move this there and we iterate on it.

Near plans:

~~Flesh out data sources, especially for streaming data~~
~~Flesh out function definition and class hierachie~~
~~I think it would be cool to build on ROCAnalysis.jl and provide plots for it using Plots.jl~~

EDIT: plans are discussed and iterated on in other issues

Delegation to other packages

Just had a nice lunch with @tbreloff @joshday and @jmxpearson where we discussed having a stand-alone package for online optimization (stochastic gradient descent, ADAM, etc.). I hadn't realized so much had been implemented already in LearnBase. I'm opening this mostly to start a discussion where I can get caught up to speed before I go off and spend time re-implementing the work done here. This issue of course dovetails off previous conversations, but I felt it more appropriate to start a new thread.

First off, great work -- I really like the motivation for this package and it is beautifully organized. But I'm a bit confused as to what the scope of this project is and how I can contribute and work synergistically. Does this package just specify and interface or set of functions that researchers will import and extend in their own packages? Or will the actual code for specifying and optimizing various models all live here?

Specific comments/questions:

How can the scope of this project be explained more clearly to new users? What is the 15 second explanation? (Also see #2)
I am a bit worried that functionality will be buried and hidden from users since this package aims to be very general. I have been interested in a simple SGD implementation, but didn't realize it was available here.
In general, under what conditions will LearnBase delegate responsibilities to other packages? In my view, an appropriate amount of delegation would make it easier to maintain and extend packages as new bugs/issues arise and as research advances and techniques change.
I am happy to push forward on making some optimization packages that play well with the models/interface here. How should I go about doing this?

Adding scale parameter to losses

Suppose I'm working on a dataset with n observations of p measured features. For simplicity let's say all measurements are real-valued and I want to use L2DistLoss. However, each feature has a difference variance, which could happen for a variety of reasons.

As a result, I want to scale the loss of each feature so that errors in the noisier features are penalized less. For example, if p = 2 I might calculate the full objective function like this:

for n = 1:nobs
    # penalty for first feature with scale 1.0
    objective += 1.0*value(L2DistLoss(), actual[n,1] - prediction[n,1])
    # penalty for second feature with scale 2.0
    objective += 2.0*value(L2DistLoss(), actual[n,2]  - prediction[n,2])
end

My question is whether it would be useful to define a new type called something like ScaledLoss so that the calculation becomes:

for n = 1:nobs
    # penalty for first feature with scale 1.0
    objective += value(ScaledLoss{L2DistLoss,1.0}(), actual[n,1] - prediction[n,1])
    # penalty for second feature with scale 2.0
    objective += value(ScaledLoss{L2DistLoss,2.0}(), actual[n,2]  - prediction[n,2])
end

The advantage being that this scaling can be passed through when calculating gradients, etc.

Note: these loops are really just for demonstration. I wouldn't actually calculate the objective this way, but I think it illustrates the point.

Improve documentation

We have a rudimentary documentation in place, but it could be a lot better.

More description text for each loss
Formula for derivative for each loss (?)

Register in METADATA

So... I think LearnBase is ready to be registered. It's not complete, but what package is? It's got a ton of content, and I'm using it as the basis for a couple packages (OnlineAI is one). What do you think? Should we start a checklist of things to finish before we register?

Loss Functions

This issue continues the discussion started at #4 concerning loss functions.

I thought a lot about our discussions so far and did a lot of targeted reading on the subject. Since the author of EmpiricalRisks seems very busy currently I am starting to agree that we should at least for now do our own Loss functions (we can always merge efforts later). I really want to keep the momentum going.

I suggest we simply go ahead and establish a package that is only concerned with loss functions but does those well. It should perfectly serve all our current usecases at least.

In fact, I made the first steps and outlined quite some code at LossFunctions.jl. The implementation is heavily based on the in-depth treatment of loss functions by Ingo Steinwart et al.

This is what I did so far:

Essentially I'd like to keep the type hierarchy small but the following abstract tree seems essential to me. Note how this allows to dispatch on the loss being a classifier or not (which I need for SVMs)

Cost,
  Loss,
    SupervisedLoss,
      MarginBasedLoss,
      DistanceBasedLoss,
    UnsupervisedLoss,

Furthermore I have included quite a collection of verbs from the literature that seem useful. Not all Losses have to support all verbs

    value,
    deriv,
    deriv2,
    value_deriv,

    value_fun,
    deriv_fun,
    deriv2_fun,
    value_deriv_fun,
    representing_fun,
    representing_deriv_fun,

    isminimizable,
    isdifferentiable,
    isconvex,
    isnemitski,
    isunivfishercons, # useful for multivariate
    isfishercons,
    islipschitzcont,
    islocallylipschitzcont,
    isclipable,
    ismarginbased,
    isclasscalibrated,
    isdistancebased,
    issymmetric

Concerning the Losses itself I actually want to provide two ways to interact with the most common ones.

Using loss functors, which is the most flexible and general approach

myloss = LogitLoss()
myloss(x) # for margin- and distance- based
myloss'(x) # for margin- and distance- based
value(myloss, ...)
deriv(myloss, ...)
deriv2(myloss, ...)
value_deriv(myloss, ...)

f = value_fun(myloss) # also possible
g = deriv_fun(myloss)

~~Using plain functions for simple use cases (at least for the common losses)~~ EDIT: I removed that for now to avoid confusions

logit_loss(...)
logit_deriv(...)
logit_deriv2(...)
logit_value_deriv(...)

for now I have just implemented LogitLoss, L1Loss and L2Loss as examples. But once we come to some agreement I would happily implement any loss that I can think of.

What do you think? Any suggestions or criticism?

WeightedBinaryLoss

A generalization of #48 (and as such a substitution of it) for any margin-based loss; It should be realized in the form of a decorator (similar to what a ScaledLoss is).

let L_a be the weighted version of L with weight parameter a

L_a(y, yhat) = if y==1
     (1 - a)*L(yhat)
elseif y == -1
     a*L(-yhat)
end

EDIT: while this is a decorator for any margin-loss, the resulting Loss itself is not margin-based

Remove UnicodePlots dependency in favour of example plots in docstring

I was thinking. The main reason I had for including UnicodePlots was for the little example plots in show. Well maybe the better place would be in the docstring for each loss function, which are predefined strings as we know. For example the result for ?HingeLoss would include a little precomputed unicodeplot-plot at the bottom showing what the loss function and derivative look like.

Consider tests against ArrayFire.jl

https://github.com/JuliaComputing/ArrayFire.jl

Worth thinking about given that it has a CPU back-end that should work on travis. In the end we would like to run these things on the GPU.

Thoughts?

Wishlist

What functionality should be accessible? How should we access it? What is core? What exists already in a usable state? (see StatsBase.jl, MLBase.jl, MachineLearning.jl, ...) What should be wrapped/linked? What should be left unimplemented, waiting for 3rd party extension?

This is not a complete list... just a placeholder which we should add to.

Concepts:

Data preprocessing
- Convert text/class data
- NaN cleaning
- test/train split, cross validation iteration
- sampling methods, etc
- centering/normalization
Loss functions
- L2, cross entropy, etc
Penalty functions
- L1/L2
- Dropout (ANNs)
Solvers/Fitters
- batch/online
- brute force/gradient based
- evolutionary
Evaluators
- stats (r2, roc, etc)
Analysis
- tables, plotting, etc
Ensembles
Hyperparameter search/optimization
Chaining/pipelines
Parallelization (distributed data, etc)

Models/algorithms:

Clustering/Dimensionality Reduction
- SVM
- KNN
- PCA
- Autoencoders
- RBMs
Classification (decision trees, forests)
Regressions (linear, logistic, lasso, etc)
Bayesian inference
Convolutions (general and image-specific)
ANNs
RNNs
SNN (spiking NN)

ExpLoss

Another margin-based loss function to add to our collection is the exponential loss L(agreement) = exp(-agreement) used in AdaBoost.

Multivariate Loss

One bridge we have to cross sooner or later are multiclass problems. There are mutliclass extensions or formulations for a couple of losses that we have.

A particular interesting example is the hinge loss. In a multinomial setting the targets could be indices, such as [1,2,3,1], which would violate our current idea of a targetdomain being [-1,1].

One solution could be to think of the multivariate version as separate of the binary one. For example we could lift it like this: Multivariate{L1HingeLoss}. This could then have it's own targetdomain and other properties. It would also avoid potential ambiguities when it comes to dispatching on the types targets and output. I am not sure we could be certain of dealing with a multivariate vs binary case just based on the parameter types

WeightedZeroOneLoss

Again for completeness sake. As the name suggests a weighted version of the classification loss (zero one loss) which penalizes the errors differently according to a hyper parameter. (see p. 24, example 2.5 in Steinwart 2008)

Steinwart, Ingo, and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.

Rename MarginLoss to AgreementLoss?

Similar to how value, deriv, etc. are a function of the distance (residual) for DistanceLoss, how about changing MarginLoss to AgreementLoss, as those methods are a function of the agreement?

Allow member variable of Losses to be Float32

We currently have somewhat of an inconsistency for what type member variables of Losses are.

Some are flexible

but some are hardcoded to Float64 for little reason. We should make them flexible as well

SmoothedL1HingeLoss
DWDMarginLoss

0.5 call overloading

It's not critical right now, but good to keep an eye on. Call overloading for abstract types has apparently gone away, which is causing travis failures:

WARNING: deprecated syntax "call(c::PredictionLoss, ...)".
Use "(c::PredictionLoss)(...)" instead.
ERROR: LoadError: LoadError: LoadError: LoadError: cannot add methods to an abstract type
 in include(Core.#include, UTF8String) at /Users/travis/julia/lib/julia/sys.dylib:-1
 in include_from_node1(Base.#include_from_node1, ASCIIString) at /Users/travis/julia/lib/julia/sys.dylib:-1
 in include(Core.#include, UTF8String) at /Users/travis/julia/lib/julia/sys.dylib:-1
 in include_from_node1(Base.#include_from_node1, ASCIIString) at /Users/travis/julia/lib/julia/sys.dylib:-1
 in include(Core.#include, ASCIIString) at /Users/travis/julia/lib/julia/sys.dylib:-1
 in include_from_node1(Base.#include_from_node1, ASCIIString) at /Users/travis/julia/lib/julia/sys.dylib:-1
 in eval(Core.#eval, Module, Any) at /Users/travis/julia/lib/julia/sys.dylib:-1
 in require(Base.#require, Symbol) at /Users/travis/julia/lib/julia/sys.dylib:-1
 in include(Core.#include, ASCIIString) at /Users/travis/julia/lib/julia/sys.dylib:-1
 in include_from_node1(Base.#include_from_node1, UTF8String) at /Users/travis/julia/lib/julia/sys.dylib:-1
 in process_options(Base.#process_options, Base.JLOptions) at /Users/travis/julia/lib/julia/sys.dylib:-1
 in _start(Base.#_start) at /Users/travis/julia/lib/julia/sys.dylib:-1
while loading /Users/travis/.julia/v0.5/LearnBase/src/loss/abstract.jl, in expression starting on line 5
while loading /Users/travis/.julia/v0.5/LearnBase/src/loss/loss.jl, in expression starting on line 2
while loading /Users/travis/.julia/v0.5/LearnBase/src/LearnBase.jl, in expression starting on line 64
while loading /Users/travis/.julia/v0.5/LearnBase/test/runtests.jl, in expression starting on line 1

The relevant issue: JuliaLang/julia#14919

Overlap

I'm starting this as a catch-all for discussion about concepts that heavily overlap between disciplines. I had started to contemplate how to merge the implementation that you're developing here with the concepts/requirements that I see in OnlineStats/OnlineAI.

(sorry... this is a little rambling...)

Similar:

Link functions in GLM (https://en.wikipedia.org/wiki/Generalized_linear_model#Link_function)
Activation functions in ANNs (https://en.wikipedia.org/wiki/Activation_function)
Kernels (could be considered similar with the right perspective)

A link function is effectively of the form: f(y) = x'w + b or y = f'(x'w + b)
An activation function is effectively: y = f(x'w + b)... the name is more a holdover from the perceptron's discontinuous {0,1} classification, but over time has seemed to blend more with the concept of a link function. So we see a "sigmoid activation" is effectively the same as the "inverse logit link" in the GLM sense. It's valuable to think of a neural net with no hidden layers to be conceptually similar to a GLM... we are transforming the inner product of inputs and weights (plus bias), with the goal of maximizing the likelihood that w/b generate the relationship.

Now, neural nets are built from connected layers of these GLMs, which just happen to have solvers that exploit the known structure to provide better "measures of fit" for the earlier layers. The likelihood is in relation to the whole of the structure, as opposed to just a single function of f(x'w + b).

Lets jump ahead and think of how we can apply a "layering" concept to other areas. You mentioned in another issue that you would like to do something like:

cs = fit(CenterScale(), Xtrain)
Xtrain2 = predict(cs, Xtrain)

pca = fit(PCA(), Xtrain2)
Xtrain3 = predict(pca, Xtrain2)

myfit = fit(MyModel(), Xtrain3)
yhat = predict(myfit, Xtrain3)

In some sense this is an extremely specific 3-layer neural net, and I think some abstractions can be seen. Assume we have 3 input variables, and 1 target. Imagine instead:

the first layer is y1 = f1(x'w1 + b1), with w1 the identity matrix, and f1 the identity function.
the second layer is y2 = f2(y1'w2 + b2), with w2 as the principal components loading matrix, f2 the identity function
the third later is the model y3 = f3(y2, w3, b3)

Thinking about things like this I think will open your mind to some alternative ways to think about modeling pipelines... specifically why are you focused on de-meaning the inputs, or reducing dimensions using PCA, if your final goal is really to have some larger function which maps initial inputs to final target values. The concept behind PLS is very similar... instead of 2 independent steps of "reduce dimension, then fit", it combines into a unified step "reduce dimensions and fit". This is how neural nets work as well, and backprop just gives a method to fit the weights of earlier models using the chain rule and intermediate gradients.

There are also parallels to a lot of the deep learning approaches... namely the use of Restricted Boltzmann Machines and Stacked Autoencoders (both have heavy parallels with PCA and dimensionality reduction) as inputs into a simpler predictive model. There has been good success in using these models/algorithms simply as weight initialization in a larger system, where those models are effectively re-fit using the final target as driving the weight updates, here with a heavy parallel to PLS.

The point I'm trying to get across with all these tangents is that I think it can be very helpful when designing, to think about how simpler models (pca, svm, glm, ...) could be reinterpreted in terms of a more complicated framework. With the right abstractions, we could recreate something like PLS automatically by plugging the generalized PCA with generalized logistic regression.

ready for prime time?

@Evizero I'm working on a new recurrent neural net framework within OnlineAI, and I'm considering whether I can switch the loss models to those in LearnBase. I see you haven't pushed any updates in a while... what is your latest thinking on this repo? There are a few minor aspects that I would consider changing (remove UnicodePlots dependency, add "Abstract" to some of the type names, etc), as well as some conventions (multiplying by 0.5 for L2 loss, for example)... but I expect most of this is good as-is. What's your opinion on the current state of the package, and would you be happy for me to submit PRs to change some of this stuff to my liking?

Update readme to reflect current functionality

Right now the README is in a very old and not very useful state

It should include description for

Loss Functions
Recipes preview
Property list?

Add min and argmin to losses

At the moment the minimum of the PoissonLoss is non-zero. This does not change the answer to whatever optimization problem you are solving, so I am inclined to leave it as is for now.

If we decided we want the convention to be zero loss for perfect fit, we'd have to make the following change:

# current:
value(::PoissonLoss, t::Number, o::Number) = exp(o) - t*o

# add a shift:
value(::PoissonLoss, t::Number, o::Number) = exp(o) - t*o + t*log(t) - t

This convention may have some advantages, but I'd have to think carefully about it. A downside is that it adds potentially unnecessary computations.

Corallary: If we don't do the re-shifting, it might be nice to add some functions:

minimum(::PoissonLoss, target::Number) = t*log(t) - t   # best achievable value
argmin(::PoissonLoss, target::Number) = log(t)          # value of output to achieve min

Repo needs a find/replace for stricktly -> strictly

I noticed the misspelling in some doc strings.

Add WeightedLosses

Weighted losses seem to be more general than the loss functions considered here. I personally want it so that why I can also specify time-weighted regressions like the two-stage method here using the same infrastructure as LossFunctions.jl. Other places where this comes up are things like LOESS. It would be a very small expansion to the API to allow passing a weight vector (as optional, defaults to ones) and weight function, but it seems like it could have many uses.

Handling binary sparse vectors and arrays

I think there are a lot of issues for handling sparse data. The main one I want to highlight is for sparse binary data we may want to rethink the {+1,-1} encoding convention. Consider applications where you are trying to detect sparse, binary events. Then I think it makes sense for the target to be a sparse vector and for us to interpret the zeros as -1s.

using LossFunctions
t = [-1,1.]
t_sparse = sparse([0,1.])
x = randn(2);
value(LogitMarginLoss(),t,x)
2-element Array{Float64,1}:
 0.0806863
 0.0330961

value(LogitMarginLoss(),t_sparse,x)
2-element Array{Float64,1}:
 0.0454645
 0.0330961

There are a variety of efficiencies we could implement in this case as well.

Documentation of Loss properties

It would maybe be nice to describe the implemented properties. For example the description for isconvex could state a short formal or informal definition for what makes a loss convex. etc

note to self: link to places. add reference in contributing issue

Remove conditional dependencies

right now there is still this placeholder code https://github.com/Evizero/LearnBase.jl/blob/master/src/optim/optim.jl#L6-L26 I was working on. I think the current best way would be to utilize Optim.jl ~~using requires~~

ZeroOneLoss; Margin-based or not

I remember the question being raised why the ZeroOneLoss is not a subtype of MarginLoss (although I can't find the associated issue).

I investigated a bit to retrace my steps. The book I have been using as a guide states the classification Loss is not margin-based, yet our implementation seems to fit the definition. I found out that the reason for this is that I simply did not implement the loss as it is described in the book (which is L(y,t) = 1 (y sign t), where the 1 is an indicator function). The book version clearly isn't margin based, since it can't be written as L(y*t).

At some point I must have been compelled to try and formulate the Loss as a margin-based one, but then forgot to reflect that in the type tree.

tl;dr: Our implementation is margin-based. will fix.

Undefined exports

I just noticed grad_fun and value_grad_fun are exported but not defined anywhere.

Add proximal mappings

I like the idea of having a standard bank of loss functions and parameter penalties. One thing that would be very useful to compute for all instances of this would be proximal mappings. These form the basis for a large class of optimization algorithms for non-smooth functions (authoritative review here: http://stanford.edu/~boyd/papers/prox_algs.html).

The definition of the proximal operator/mapping is:

prox(f,x0,rho) = argmin_x f(x) + (0.5/rho)*norm(x-x0)^2

Where f is the function (typically a penalty or loss), x0 is the current parameter guess, and rho tunes the step size. If rho is small (near zero) then the parameter update (x - x0) is in the direction of the gradient -- i.e. the proximal mapping performs gradient descent (albeit with small step sizes). See the review above for more intuitions.

To start, I would propose two new functions prox and prox! for loss and penalty functions. For the L1 penalty:

function prox(r::L1Penalty, x0::Array, ρ::Float64)
    return max(0, x0 - ρ) - max(0, -x0 - ρ)
end

function prox!(r::L1Penalty, x0::Array, ρ::Float64)
    for i = 1:length(x0)
        x0[i] =  max(0,x0[i] - ρ) - max(0, -x0[i] - ρ)
    end
end

I have some rough code here, which I am happy to port over: https://github.com/ahwillia/ProxAlgs.jl

I also have some optimization routines implemented in that package, like ADMM. I could port those over as well, but as I brought up in #22 -- I'm still a bit conflicted over whether we should be fleshing out full optimization routines in this package.

Corollary 1: The grad and grad! functions are a bit tough for me to parse by just perusing over the source code. Is there a reason for these to be so closely tied to risk models? I think it makes sense for prox and prox! to mirror how we calculate gradients.

Corollary 2: Have we thought about how to represent objective functions with multiple penalties. For example, I have to implement different prox operators for L1, L2, and ElasticNet, even though elastic net is just a linear combination of L1 and L2 penalties. In other words the prox(f+g,x0,r) =\= prox(f,x0,r) + prox(g,x0,r)

Distance Weighted Discrimination MarginLoss

DWD is a differentiable generalization of L1HingeLoss (but different from SmoothedL1HingeLoss). Is DWDMarginLoss a good enough name?

See page 11. https://arxiv.org/pdf/1508.05913.pdf

Screenshot storage

I'll abuse this issue to store screenshots used in documentation and readme. Not the cleanest solution there ever was or will be, but certainly convenient 🐢

Rename isstronglyconvex to isstrictlyconvex

I just realized that I named the method wrong in the beginning. There are subtle differences between being strictly convex and strongly convex, and right now I marked all the losses as strongly convex when I actually should have marked them as strictly convex.

I don't think the logistic margin loss is strongly convex for example, while it is strictly convex.

RFC: cross entropy changes

I think this is wrong. I just did a quick back-of-the-napkin derivation, and I think the "deriv" should be dE/dy (not dE/ds):

crossentropy_deriv.pdf

code should probably be:

function value(l::CrossentropyLoss, y::Number, t::Number)
    if t == 0
        -log(1 - y)
    elseif t == 1
        -log(y)
    else
        -(t * log(y) + (1-t) * log(1-y))
end

deriv(l::CrossentropyLoss, y::Number, t::Number) = (1-t) / (1-y) - t / y

Note I switched y/t in the value function (I'm assuming y is the estimated probability and t is the target value)

Note that these calculations only make sense for y in (0,1)... if y is exactly 0 or 1 then we start to get infinities/NaNs. Do we want to check for this somehow?

Note that most computations actually care about the sensitivity of the error to the input to a sigmoid function. So if y = sigmoid(s), we actually care about delta = dE / ds = (dE / dy) * (dy / ds). When you work out the math of the derivative of the sigmoid times the derivative of the cross-entropy function, you end up with the simple: delta = y - t, which is what you see everywhere. The problem is that for generic libraries, the two derivatives need to be computed in different abstractions (loss vs activation).

Add ability to query domain of a loss function

I think it would be nice to add something along the lines of the following:

l1 = LogitMarginLoss()
l2 = HingeLoss()
l3 = PoissonLoss()
l4 = L2DistLoss()

domain(l1)   # returns -1:2:1
domain(l2)   # or maybe... BinaryVariable (a type we define)

domain(l3)   # all non-negative integers... how to represent?

domain(l4)   # returns Real

"domain" might be the wrong name or at least misleading because the losses are defined over all real numbers... But I want to access the possible values of the data.

Move Package to JuliaML

As a last step before registering

L2MarginLoss

The margin formulation of the least squares loss. Pretty much the L2HingeLoss but not truncated at agreement >= 1. Not the most interesting loss, but for completeness sake it would be nice to have it in here. (see p. 35, example 2.26 in Steinwart 2008)

Steinwart, Ingo, and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.

SigmoidLoss

I found a margin-based loss mentioned in the SVM book I have not encountered before referred to as the sigmoid loss defined as L(agreement) = 1 - tanh(agreement).

Have sumvalue accept higher-order arrays

For matrix decompositions, factor analysis, similar models the predicted object is a matrix or higher-order array of data. In this case, the total loss could be calculated

sumvalue(loss::Loss, target::Matrix, output::Matrix)

As mentioned in passing in #27, currently sumvalue requires target::AbstractVector which forces me to vectorize target before calling the function:

sumvalue(loss, view(target,:), view(output,:))

This is a fine workaround for now.

But my preference would be to have sumvalue accept target and output as AbstractArrays and check that they have the same dimensions (throwing an error if the check fails). Then it iterates over the two arrays and computes the total loss.

Update to work with 0.5 changes

Some issues are upstream, but a few things could be done already

remove call overloads
Get rid of array views in favour of native SubArrays using slice and sub

ScaledLoss - Promote scale factor to type parameter

While snooping around at OnlineStatsModels.jl, I saw that Josh had a good use-case for a ScaledLoss that I had not thought about before here. Basically a 0.5 times scaled version of the L2DistLoss, that is quite common (especially in introduction ML literature, and I would guess in Statistics).

As it is now, we offer no way to create a typealias for a scaled version of a loss, just an instance.
In Josh's use-case above it would make sense to be able to write typealias LinearRegression ScaledDistLoss{L2DistLoss,0.5}. We could make this happen by putting the scale factor into the type itself. I don't see a downside to that, other than type instability when writing 0.5 * L2DistLoss(), but this kind of code is probably not found in some inner loop, and one could always write ScaledDistLoss{L2DistLoss,0.5}() instead if it really matters.

Maybe this way I can also drop the distinction between ScaledDistLoss and ScaledMarginLoss, which would be preferable

pipelines of transformations discussion

Continuing the discussion from #8...

I'd like to plan out the long term goals/terminology of the computation pipeline for fitting and using various ML models. Here's my current thinking... lets try to poke holes in its viability using some real-world models that should be able to handle:

Data preprocessing: filtering, projections, NA replacement, column transforms
Regression variants
SVM variants
Random Forest variants
ANN, RNN variants
Ensembles, model aggregation

Empirical risk minimization is a (IMO confusing) name for minimizing a function of model loss (we currently call this PredictionLoss... do you like ModelLoss instead?) and parameter loss: L(loss, ploss). I would prefer we migrate away from the ERM name, even if it is a "descriptively correct" identifier. Maybe we could just call it TotalLoss:

immutable TotalLoss{ML <: ModelLoss, PL <: ParameterLoss} <: Loss
    mloss::ML
    ploss::PL
end

Pipelines of transformations

To sum up a typical real-world problem... we want to build a 2-class classifier for a dataset. To build a logistic regression:

data --> filter_columns --> filter_rows --> replace_NAs --> add_transformed_vars
     --> learnable_affine_transformation --> SigmoidActivation --> LogitDistLoss
target --> LogitDistLoss

Now we want to do the same thing for an SVM... it's pretty similar, but there's probably no need to add transformed vars (cross multiplying xi * xj for example) because that's what the kernel is for:

data --> filter_columns --> filter_rows --> replace_NAs
     --> learnable_affine_transformation --> LogitMarginLoss
target --> LogitMarginLoss

An ANN would be similar to the logistic regression, but there would be many affine_transformation --> activation in sequence.

A decision tree may not need the filter/transform steps (or they may be slightly different), but their learnable transformation would be a "tree of filter transformations".

In all of these examples, there are some parts of the pipeline that can be reused. If one was to create many different models, whether as part of an ensemble or something else, I think treating these pipelines as directed graphs could give some great opportunities for optimization/caching and visualization.

Ensembles

Now a random forest would be an aggregation of decision trees, so we should be able to cover that as an EnsembleTransformation, which takes multiple other transformations as input and outputs a single vector. In theory, there's no reason we can't add an ensemble as part of a fitting/backprop algorithm, passing error sensitivity/responsibility to individual models through the aggregation. It would also open up SGD-style training of layers of ensembles (I'm picturing the US Electoral College voting process as an example).

The requirement is that an ensemble shares some basic verbs with "constant transforms" (such as taking a square root) and more complex transforms (like a decision tree). It must be able to go forward to update its internal state and produce an output value, and it must be able to go backward to update gradients/etc, compute error responsibilities, and adjust learnable parameters.

Summary

Anything that takes in an input vector(s) and produces an output vector can be considered a Transformation. If we can be smart about how to connect various transformations during the forward and backward passes of a learning algorithm, then it opens up some really nice opportunities in building, optimizing, and visualizing modelling pipelines, as well as potentially making ensemble-type aggregation part of the supervised learning step.

Add regularization to loss functions

It would be cool to be able to specify some function from https://github.com/JuliaML/PenaltyFunctions.jl to use as a regularization term in a loss function. This will make this setup able to capture most of the common loss functions more easily.

Pinball Loss

We should also implement the Pinball loss aka Quantile Loss. (see p. 44, example 2.43 in Steinwart 2008)

Steinwart, Ingo, and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.

Type stability of derivatives

There seems to be a lack of testing for the typestability of the deriv methods, which I thought I had covered. Furthermore this allowed some of our implementations to be sub-optimal in terms of type-inference

For example this function doesn't look type-stable for agreement::Int:

function deriv{T<:Number}(loss::SmoothedL1HingeLoss, agreement::T)
    if agreement >= 1 - loss.gamma
        agreement >= 1 ? zero(T) : (agreement - one(T)) / loss.gamma
    else
        -one(T)
    end
end

Potential conflict with `Base.LinAlg`

Is this anything to be worried about?

julia> module X
       using Losses
       using Base.LinAlg

       # errors because ambiguous
       issymmetric(L2DistLoss())
       end
WARNING: both LinAlg and Losses export "issymmetric"; uses of it in module X must be qualified
ERROR: UndefVarError: issymmetric not defined

switch testing to BaseTestNext

This will be a pain in the ***, but I think worthwhile. The major problem with the current testing is that, on an error, the tests are halted, and so you can't see the results of all the tests. The output is also a little nicer, and things like your msg/msg2 can be handled nicely by creating nested contexts.

I'll take this on, if you approve of the switch.

juliaml / lossfunctions.jl Goto Github PK

lossfunctions.jl's People

Contributors

Stargazers

Watchers

Forkers

lossfunctions.jl's Issues

Pipelines of transformations

Ensembles

Summary

Recommend Projects

Recommend Topics

Recommend Org