juliaml / mllabelutils.jl Goto Github PK

View Code? Open in Web Editor NEW

32.0 32.0 13.0 174 KB

Utility package for working with classification targets and label-encodings

Home Page: http://mllabelutilsjl.readthedocs.io/

License: Other

Julia 100.00%

classification julia machine-learning preprocessing

mllabelutils.jl's People

Contributors

Stargazers

Watchers

Forkers

tkelman oxinabox carlolucibello roberthoenig abhiyad holtri englhardt stjordanis darsnack mleprovost racinmat kristofferc

mllabelutils.jl's Issues

OOV / default value for NativeEncoding

This is related to #5

It would be good to have the ability to tell NativeEncoding what to do when it is asked to encode a value that is outside it's key set.

This shows up for example in NLP where it is common to replace any word not encountered during training with an "<UNK>" (unknown) or "<OOV>" (out of vocabulary) pseudo-token (which could be trained per #5, or otherwise) that is known to the encoding.

Right now I am working with pretrained word-embedding (https://github.com/JuliaText/Embeddings.jl),
and I was going to use convertlabels(Indices, words, :;NativeEncoding)
to convert all the words to the index to their matching embedding.
For my case I'ld want to map any OOV word to a zero-vector.

I'm not sure if this is best implemented by NativeEncoding,
or perhaps by a different type NativeEncodingWithDefault that would share a common super-type AbstractNativeEncoding,
or as a argument to some methods convertlabels(Indices, words, :;NativeEncoding; default="<OOV>")

A Rare Label encoding, the combined all rare classes to one common label

In cases with highly unbalanced class distribution,
some classes occur in the training data so rarely, that it is better to ignore them.
In these cases, it would be useful to be able to set a threshold, and have all labels that occur less than this be mapped to a single label.

Here I will work with symbols for example clarity.

origlbls = [:A, :A, :A, :B, :B, :B, :C,:D, :A]

encoding, newlbls = relabel_rares(origlbls, threshold = 3, rarelabel=:X) #Signature for example only
@assert newlbls =  [:A, :A, :A, :B, :B, :B, :X,:X, :A]

The data of which input labels are not rare, should be stored in the encoding, so it can be repeated.
Possibly it should also have a parameter to determine if new never before seen labels in the test data are an error, or just another rare label. (Possibly not, though).

This encoding would be chained with other encodings.

It would also go well with a method to filter in MLDataUtil,
for when it is permissible to exclude these rare labels from the training entirely.

Missing methods for converting between NativeLabels and Indices

As outlined by @oxinabox in #17

It should be possible to convert upwards in the number of labels. at least for integers. (e.g. from a Indices{_,2} to a NativeLabels{_,3})

enc = labelenc(["copper", "tin", "gold"])
convertlabel(enc,  [2, 1])

We should support single integers when the target encoding is unambigous

convertlabel(enc, 3)

It would be useful to have the source encoding inferrable from just the type. I do think to remember though that there was a good reason why I didn't do it in the first place

convertlabel(enc, 3, LabelEnc.Indices)

NativeLabel `label2ind` has overly tight type constraints

Consider the example:

julia> using MLLabelUtils

julia> train_raw = split("red green red blue red purple")
6-element Array{SubString{String},1}:
 "red"   
 "green" 
 "red"   
 "blue"  
 "red"   
 "purple"

julia> encoding = labelenc(train_raw)
MLLabelUtils.LabelEnc.NativeLabels{SubString{String},4}(SubString{String}["red","green","blue","purple"],Dict("purple"=>4,"blue"=>3,"green"=>2,"red"=>1))

julia> train_inds = label2ind.(train_raw, encoding)
6-element Array{Int64,1}:
 1
 2
 1
 3
 1
 4

julia> label2ind("green", encoding)
ERROR: MethodError: no method matching label2ind(::String, ::MLLabelUtils.LabelEnc.NativeLabels{SubString{String},4})
Closest candidates are:
  label2ind{T}(::T, ::MLLabelUtils.LabelEnc.NativeLabels{T,K}) at /home/ubuntu/.julia/v0.5/MLLabelUtils/src/labelencoding.jl:157
  label2ind{T}(::Union{Number,T}, ::MLLabelUtils.LabelEnc.Indices{T,K}) at /home/ubuntu/.julia/v0.5/MLLabelUtils/src/labelencoding.jl:158
  label2ind{T}(::Union{Number,T}, ::MLLabelUtils.LabelEnc.OneOfK{T,K}) at /home/ubuntu/.julia/v0.5/MLLabelUtils/src/labelencoding.jl:159
  ...

julia> label2ind(SubString("green",1), encoding)
2

Here I can't encode "green" because it is not a SubString.
But really there is a one-to-one correspondence between them,
and they can be used to do dictionary lookups etc and will match.

So I suggest that:

MLLabelUtils.jl/src/labelencoding.jl

Line 157 in 54cafac

label2ind{T}(lbl::T, lm::LabelEnc.NativeLabels{T}) = Int(lm.invlabel[lbl])

Be changes to:

label2ind(lbl, lm::LabelEnc.NativeLabels) = Int(lm.invlabel[lbl])

And then it can fail, if it fails (python style); but will work if it works.
It is not like that constraint is doing anything useful for dispatch purposes anyway, no?

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

Update compat requirement for MappedArrays

Hello,

A new version of MappedArrays 0.3.0 has been released. Is it possible to update the compat part of the Project.toml file to include this version.

Best regards,

Why this package

First of all sorry again that so many people are automatically subscribed to each new package, but I am not sure how to create new packages without this side effect.

Before my recent work-related Julia break I was in the middle of updating the label-compare logic of LossFunctions.jl and MLMetric.jl, and also thought on how to go about the label encoding in MLDataUtils.jl. Since then I got some ideas that I am now trying to implement.

Main purpose of this package is comparison between a predicted vector and a target vector (not full metrics but simply the logic how to compare, which will losen the [-1,1] restriction of LossFunctions(!) and provide a basis for MLMetrics), as well as conversion between different representations (e.g. OneOfK, OneVsRest, etc).

First issue was where to put the classification-label related code:

MLDataUtils: Because that is a user-facing package with probably quite heavy dependencies down the road (e.g. Distributions, DataFrames), and LossFunctions/MLMetrics would need access to that code, MLDataUtils is a no-go.
LearnBase: Obvious candidate, and will at least house the type and function definition, but I suspect there will be quite a lot of code that may be a bit too much for LearnBase
LossFunctions: This was my most likely choice, but label-encoding conversion and label-comparision seems to general to hide in there.
MLMetrics: That package will depend on LossFunctions, and since LossFunctions needs this code as well this won't work

Given those reasons I decided to try outlining a separate package and see where it leads. It will have a LearnBase dependency and probably MLDataUtils, LossFunctions, and MLMetrics will depend on it down the road to provide their classification-related functionality

Support Equality between identical encoders

It would be nice if Equality was supported for encoders.

Eg right now:

using MLLabelUtils

enc1 = labelenc(split("a b c a"))
enc2 = labelenc(split("a b c a"))


@show enc1==enc2
@show enc1.invlabel == enc2.invlabel
@show enc1.label == enc2.label

output:

enc1 == enc2 = false
enc1.invlabel == enc2.invlabel = true
enc1.label == enc2.label = true

For native encoders equality on either of the fields implies equality of the whole.

NativeLabels should probably not be based around Int64

Currently, NativeLabels will always work in the machine's Int.
Which will be Int64, on most systems being used for ML.

I suggest that Int32 everywhere would be better.
Given that it is intended for encoding classes,
having more than typemax(Int32)==2_147_483_647 classes,
seems like a not reasonably scenario.
Even in NLP when learning word-sense vectors, one is still looking at millions of classes -- at most.

Having billions of training case, on the other hand, seems more reasonable.
And in that case, storing them as Int32 can halve your memory use.

I guess the downside is that if the user mixes them with integer literals,
type promotion will occur, which has a (single operation) computational overhead, and negates the storage benefit.

An argument could be made for using UInt32, but I think that would end-up too fiddly.
An argument also could be made for the user to provide the type, with it defaulting to Int.
That sounds reasonable.

NativeLabels constructor for iterator

It would be nice to be able to call LabelEnc.NativeLabels(0:9) instead of the verbose LabelEnc.NativeLabels(collect(0:9)). Adding one or two outer constructors should allow for this

Package not in Metadata

It seems the package is not yet in METADATA. I tried installing it using Pkg.add("MLLabelUtils") but it error'd out.

julia> Pkg.add("MLLabelUtils")
ERROR: unknown package MLLabelUtils
 in macro expansion at ./pkg/entry.jl:53 [inlined]
 in (::Base.Pkg.Entry.##2#5{String,Base.Pkg.Types.VersionSet})() at ./task.jl:360
 in sync_end() at ./task.jl:311
 in macro expansion at ./task.jl:327 [inlined]
 in add(::String, ::Base.Pkg.Types.VersionSet) at ./pkg/entry.jl:51
 in (::Base.Pkg.Dir.##2#3{Array{Any,1},Base.Pkg.Entry.#add,Tuple{String}})() at ./pkg/dir.jl:31
 in cd(::Base.Pkg.Dir.##2#3{Array{Any,1},Base.Pkg.Entry.#add,Tuple{String}}, ::String) at ./file.jl:59
 in #cd#1(::Array{Any,1}, ::Function, ::Function, ::String, ::Vararg{Any,N}) at ./pkg/dir.jl:31
 in add(::String) at ./pkg/pkg.jl:100

Labels for incomplete dataset

Hello,

Imagine that in my problem I have two classes [:one,:two], which are very imbalanced. If my data are huge, it can happen that I will load a data only with class :two. How can I construct the correct label encoding in this case?
For example if I want OneOfK encoding for the dataset with samples from only :two class, the output should be

[0 1,
 0 1,
 0 1,
 0 1,
 ...
 0 1].

I was not able to figure out, if such feature is supported.
Thanks for the answer.
Tomas

add onehot

I find convenient the following methods:

onehot(x) = onehot(Int, x)
onehot(T::Type, x) = convertlabel(LabelEnc.OneOfK{T}, x)

Worth the addition?

Suggestion: Restructure classify

Hi,

I have been trying to extend the classify interface to binary NativeLabels. While figuring out the internals of classify, I have found some inconsistencies that make reasoning about the classify functionality difficult. I first summarize the inconsistencies, and then suggest a restructured Api to address them.

Inconsistencies:

cutoff field in LabelEnc Type: LabelEnc.ZeroOne currently is the only LabelEnc with a field cutoff. First, a cutoff generally is also be applicable to other LabelEncodings like MarginBased. However, I would argue that cutoff is not an inherent feature of the LabelEnc, and should therefore be external to it.
An example: computing an ROC curve for ZeroOne (or any other binary LabelEnc) requires to compute the false positive rate and true positive rate at all possible cutoffs. With the current implementation, this would require to modify or create a new LabelEnc, which I would argue does not make sense semantically.

This also leads to some special cases:

function classify(value::Number, lm::LabelEnc.ZeroOne{R}) where {R}
   R(classify(value, lm.cutoff))
end

Default classify without encoding: In the current implementation classify without an encoding defaults to ZeroOne. To me this seems rather arbitrary, and I can't see the benefit of having this method overload. If there is some reason I do not see, maybe making things more explicit would help, i.e., replace

function classify(value::T, cutoff::Number) where {T<:Number}
     value >= cutoff ? T(1) : T(0)
end

with

classify(value::Number, cutoff::Number) = classify(value, cutoff, LabelEnc.ZeroOne)

or to not provide this method at all.

Type dispatching: If I understand this correctly, dispatching on a Type argument, e.g., classify(value::Number, ::Type{LabelEnc.ZeroOne}), is like a default option for that LabelEnc. To me, these defaults may not always be intuitive, and for some encodings there cannot be a default, e.g., NativeLabels (what would be the poslabel in this case?). Also, the defaults may require more effort when reasoning about code.
Type inference on value argument: Currently, type inference if dispatching on the Type argument chooses the return type based on the type of the value argument. To me it is rather unintuitive why the type of value, which may be the result of some arbitrary scoring function, should determine the type of the classification result. Surprisingly, specifying the type for a LabelEnc does not work:

classify(0.6, LabelEnc.MarginBased{Float32}) # > MethodError

Function for broadcasting is semantically overloaded: There are two methods that take a vector as a first argument but they both mean different things. One time it is a broadcast helper, e.g., , the other time a single classify on a multivalued value vector. I currently don't see why one would need the broadcast helper function. Just provide the single-value classify, and leave the broadcasting to the caller.
- broadcast: classify(values::AbstractVector{T}, cutoff::Number)
- single classify multi-valued argument: classify(values::AbstractVector, ::Type{<:LabelEnc.OneOfK})
Minor naming convention: Why is lm the naming convention for LabelEncodings variables? Would lenc be more suitable?

Suggested Api

Design principles for single value, binary classification:

The label encoding is independent of classification, i.e., there are no special cases for a cutoff field
In general, classify depends on an indicator function that decides whether a score value belongs to the positive class. It returns the poslabel/neglabel for the specified LabelEnc.
The linear cutoff, i.e, a numeric value to distinguish between both classes, is a special case of this indicator function.
Return type of classify must be explicit, i.e., no type inference based on value argument.
Broadcasting is left to the caller

This would result in something like this:

function classify(value, cutoff::Function, lenc::BinaryLabelEncoding)
    cutoff(value) ? poslabel(lenc) : neglabel(lenc)
end

function classify(value, cutoff::Number, lenc::BinaryLabelEncoding)
    classify(value, x -> x > cutoff, lenc)
end

# some defaults
classify(value::Number, lenc::LabelEnc.ZeroOne) = classify(value, larger_than_cutoff(lenc.cutoff), lenc)
classify(value::Number, lenc::LabelEnc.MarginBased) = classify(value, signbit, lenc)
classify(value::Number, lenc::LabelEnc.NativeLabels{T,2,F}) where {T,F} = classify(value, 0.5, lenc)

# If there should still be "defaults" for non-initialized LabelEnc
classify(value::Number, lenc::Type{LabelEnc.ZeroOne{T}}) where T = classify(value, 0.5, lenc())
classify(value::Number, lenc::Type{LabelEnc.MarginBased{T}}) where T = classify(value, x -> !signbit(x), lenc())

With some modification this also works for vector-based classification, like OneOfK.

I would appreciate feedback on this first, before I put more implementation effort into this.

missing methods for label2ind

I get this

julia> x=[1 0; 0 1]
2×2 Array{Int64,2}:
 1  0
 0  1

julia> label2ind(x, LabelEnc.OneOfK(2))
ERROR: MethodError: no method matching isposlabel(::Array{Int64,2}, ::MLLabelUtils.LabelEnc.OneOfK{Int64,2})
Closest candidates are:
  isposlabel(::Bool, ::MLLabelUtils.LabelEnc.OneOfK) at /home/carlo/.julia/dev/MLLabelUtils/src/labelencoding.jl:264
  isposlabel(::Number, ::MLLabelUtils.LabelEnc.OneOfK{T,2}) where T at /home/carlo/.julia/dev/MLLabelUtils/src/labelencoding.jl:269
  isposlabel(::AbstractArray{#s5,1} where #s5<:Number, ::MLLabelUtils.LabelEnc.OneOfK{T,2}) where T at /home/carlo/.julia/dev/MLLabelUtils/src/labelencoding.jl:270
  ...
Stacktrace:
 [1] label2ind(::Array{Int64,2}, ::MLLabelUtils.LabelEnc.OneOfK{Int64,2}) at /home/carlo/.julia/dev/MLLabelUtils/src/labelencoding.jl:217
 [2] top-level scope at none:0

while I was expecting label2ind(x, LabelEnc.OneOfK(2)) == [1,2], right?