juliaml / mllabelutils.jl Goto Github PK
View Code? Open in Web Editor NEWUtility package for working with classification targets and label-encodings
Home Page: http://mllabelutilsjl.readthedocs.io/
License: Other
Utility package for working with classification targets and label-encodings
Home Page: http://mllabelutilsjl.readthedocs.io/
License: Other
This is related to #5
It would be good to have the ability to tell NativeEncoding what to do when it is asked to encode a value that is outside it's key set.
This shows up for example in NLP where it is common to replace any word not encountered during training with an "<UNK>"
(unknown) or "<OOV>"
(out of vocabulary) pseudo-token (which could be trained per #5, or otherwise) that is known to the encoding.
Right now I am working with pretrained word-embedding (https://github.com/JuliaText/Embeddings.jl),
and I was going to use convertlabels(Indices, words, :;NativeEncoding)
to convert all the words to the index to their matching embedding.
For my case I'ld want to map any OOV word to a zero-vector.
I'm not sure if this is best implemented by NativeEncoding
,
or perhaps by a different type NativeEncodingWithDefault
that would share a common super-type AbstractNativeEncoding
,
or as a argument to some methods convertlabels(Indices, words, :;NativeEncoding; default="<OOV>")
In cases with highly unbalanced class distribution,
some classes occur in the training data so rarely, that it is better to ignore them.
In these cases, it would be useful to be able to set a threshold, and have all labels that occur less than this be mapped to a single label.
Here I will work with symbols for example clarity.
origlbls = [:A, :A, :A, :B, :B, :B, :C,:D, :A]
encoding, newlbls = relabel_rares(origlbls, threshold = 3, rarelabel=:X) #Signature for example only
@assert newlbls = [:A, :A, :A, :B, :B, :B, :X,:X, :A]
The data of which input labels are not rare, should be stored in the encoding, so it can be repeated.
Possibly it should also have a parameter to determine if new never before seen labels in the test data are an error, or just another rare label. (Possibly not, though).
This encoding would be chained with other encodings.
It would also go well with a method to filter in MLDataUtil,
for when it is permissible to exclude these rare labels from the training entirely.
As outlined by @oxinabox in #17
Indices{_,2}
to a NativeLabels{_,3}
)enc = labelenc(["copper", "tin", "gold"])
convertlabel(enc, [2, 1])
convertlabel(enc, 3)
convertlabel(enc, 3, LabelEnc.Indices)
Consider the example:
julia> using MLLabelUtils
julia> train_raw = split("red green red blue red purple")
6-element Array{SubString{String},1}:
"red"
"green"
"red"
"blue"
"red"
"purple"
julia> encoding = labelenc(train_raw)
MLLabelUtils.LabelEnc.NativeLabels{SubString{String},4}(SubString{String}["red","green","blue","purple"],Dict("purple"=>4,"blue"=>3,"green"=>2,"red"=>1))
julia> train_inds = label2ind.(train_raw, encoding)
6-element Array{Int64,1}:
1
2
1
3
1
4
julia> label2ind("green", encoding)
ERROR: MethodError: no method matching label2ind(::String, ::MLLabelUtils.LabelEnc.NativeLabels{SubString{String},4})
Closest candidates are:
label2ind{T}(::T, ::MLLabelUtils.LabelEnc.NativeLabels{T,K}) at /home/ubuntu/.julia/v0.5/MLLabelUtils/src/labelencoding.jl:157
label2ind{T}(::Union{Number,T}, ::MLLabelUtils.LabelEnc.Indices{T,K}) at /home/ubuntu/.julia/v0.5/MLLabelUtils/src/labelencoding.jl:158
label2ind{T}(::Union{Number,T}, ::MLLabelUtils.LabelEnc.OneOfK{T,K}) at /home/ubuntu/.julia/v0.5/MLLabelUtils/src/labelencoding.jl:159
...
julia> label2ind(SubString("green",1), encoding)
2
Here I can't encode "green" because it is not a SubString
.
But really there is a one-to-one correspondence between them,
and they can be used to do dictionary lookups etc and will match.
So I suggest that:
MLLabelUtils.jl/src/labelencoding.jl
Line 157 in 54cafac
Be changes to:
label2ind(lbl, lm::LabelEnc.NativeLabels) = Int(lm.invlabel[lbl])
And then it can fail, if it fails (python style); but will work if it works.
It is not like that constraint is doing anything useful for dispatch purposes anyway, no?
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
Hello,
A new version of MappedArrays 0.3.0 has been released. Is it possible to update the compat part of the Project.toml file to include this version.
Best regards,
First of all sorry again that so many people are automatically subscribed to each new package, but I am not sure how to create new packages without this side effect.
Before my recent work-related Julia break I was in the middle of updating the label-compare logic of LossFunctions.jl and MLMetric.jl, and also thought on how to go about the label encoding in MLDataUtils.jl. Since then I got some ideas that I am now trying to implement.
Main purpose of this package is comparison between a predicted vector and a target vector (not full metrics but simply the logic how to compare, which will losen the [-1,1] restriction of LossFunctions(!) and provide a basis for MLMetrics), as well as conversion between different representations (e.g. OneOfK, OneVsRest, etc).
First issue was where to put the classification-label related code:
MLDataUtils: Because that is a user-facing package with probably quite heavy dependencies down the road (e.g. Distributions, DataFrames), and LossFunctions/MLMetrics would need access to that code, MLDataUtils is a no-go.
LearnBase: Obvious candidate, and will at least house the type and function definition, but I suspect there will be quite a lot of code that may be a bit too much for LearnBase
LossFunctions: This was my most likely choice, but label-encoding conversion and label-comparision seems to general to hide in there.
MLMetrics: That package will depend on LossFunctions, and since LossFunctions needs this code as well this won't work
Given those reasons I decided to try outlining a separate package and see where it leads. It will have a LearnBase dependency and probably MLDataUtils, LossFunctions, and MLMetrics will depend on it down the road to provide their classification-related functionality
It would be nice if Equality was supported for encoders.
Eg right now:
using MLLabelUtils
enc1 = labelenc(split("a b c a"))
enc2 = labelenc(split("a b c a"))
@show enc1==enc2
@show enc1.invlabel == enc2.invlabel
@show enc1.label == enc2.label
output:
enc1 == enc2 = false
enc1.invlabel == enc2.invlabel = true
enc1.label == enc2.label = true
For native encoders equality on either of the fields implies equality of the whole.
Currently, NativeLabels
will always work in the machine's Int
.
Which will be Int64
, on most systems being used for ML.
I suggest that Int32
everywhere would be better.
Given that it is intended for encoding classes,
having more than typemax(Int32)==2_147_483_647
classes,
seems like a not reasonably scenario.
Even in NLP when learning word-sense vectors, one is still looking at millions of classes -- at most.
Having billions of training case, on the other hand, seems more reasonable.
And in that case, storing them as Int32
can halve your memory use.
I guess the downside is that if the user mixes them with integer literals,
type promotion will occur, which has a (single operation) computational overhead, and negates the storage benefit.
An argument could be made for using UInt32
, but I think that would end-up too fiddly.
An argument also could be made for the user to provide the type, with it defaulting to Int
.
That sounds reasonable.
It would be nice to be able to call LabelEnc.NativeLabels(0:9)
instead of the verbose LabelEnc.NativeLabels(collect(0:9))
. Adding one or two outer constructors should allow for this
It seems the package is not yet in METADATA. I tried installing it using Pkg.add("MLLabelUtils")
but it error'd out.
julia> Pkg.add("MLLabelUtils")
ERROR: unknown package MLLabelUtils
in macro expansion at ./pkg/entry.jl:53 [inlined]
in (::Base.Pkg.Entry.##2#5{String,Base.Pkg.Types.VersionSet})() at ./task.jl:360
in sync_end() at ./task.jl:311
in macro expansion at ./task.jl:327 [inlined]
in add(::String, ::Base.Pkg.Types.VersionSet) at ./pkg/entry.jl:51
in (::Base.Pkg.Dir.##2#3{Array{Any,1},Base.Pkg.Entry.#add,Tuple{String}})() at ./pkg/dir.jl:31
in cd(::Base.Pkg.Dir.##2#3{Array{Any,1},Base.Pkg.Entry.#add,Tuple{String}}, ::String) at ./file.jl:59
in #cd#1(::Array{Any,1}, ::Function, ::Function, ::String, ::Vararg{Any,N}) at ./pkg/dir.jl:31
in add(::String) at ./pkg/pkg.jl:100
Hello,
Imagine that in my problem I have two classes [:one,:two]
, which are very imbalanced. If my data are huge, it can happen that I will load a data only with class :two
. How can I construct the correct label encoding in this case?
For example if I want OneOfK encoding for the dataset with samples from only :two
class, the output should be
[0 1,
0 1,
0 1,
0 1,
...
0 1].
I was not able to figure out, if such feature is supported.
Thanks for the answer.
Tomas
I find convenient the following methods:
onehot(x) = onehot(Int, x)
onehot(T::Type, x) = convertlabel(LabelEnc.OneOfK{T}, x)
Worth the addition?
Hi,
I have been trying to extend the classify interface to binary NativeLabels. While figuring out the internals of classify, I have found some inconsistencies that make reasoning about the classify functionality difficult. I first summarize the inconsistencies, and then suggest a restructured Api to address them.
cutoff
field in LabelEnc Type: LabelEnc.ZeroOne currently is the only LabelEnc with a field cutoff
. First, a cutoff generally is also be applicable to other LabelEncodings like MarginBased. However, I would argue that cutoff is not an inherent feature of the LabelEnc, and should therefore be external to it.This also leads to some special cases:
function classify(value::Number, lm::LabelEnc.ZeroOne{R}) where {R}
R(classify(value, lm.cutoff))
end
classify
without an encoding defaults to ZeroOne. To me this seems rather arbitrary, and I can't see the benefit of having this method overload. If there is some reason I do not see, maybe making things more explicit would help, i.e., replacefunction classify(value::T, cutoff::Number) where {T<:Number}
value >= cutoff ? T(1) : T(0)
end
with
classify(value::Number, cutoff::Number) = classify(value, cutoff, LabelEnc.ZeroOne)
or to not provide this method at all.
Type dispatching: If I understand this correctly, dispatching on a Type argument, e.g., classify(value::Number, ::Type{LabelEnc.ZeroOne})
, is like a default option for that LabelEnc. To me, these defaults may not always be intuitive, and for some encodings there cannot be a default, e.g., NativeLabels (what would be the poslabel in this case?). Also, the defaults may require more effort when reasoning about code.
Type inference on value argument: Currently, type inference if dispatching on the Type argument chooses the return type based on the type of the value
argument. To me it is rather unintuitive why the type of value
, which may be the result of some arbitrary scoring function, should determine the type of the classification result. Surprisingly, specifying the type for a LabelEnc does not work:
classify(0.6, LabelEnc.MarginBased{Float32}) # > MethodError
Function for broadcasting is semantically overloaded: There are two methods that take a vector as a first argument but they both mean different things. One time it is a broadcast helper, e.g., , the other time a single classify on a multivalued value vector. I currently don't see why one would need the broadcast helper function. Just provide the single-value classify, and leave the broadcasting to the caller.
classify(values::AbstractVector{T}, cutoff::Number)
classify(values::AbstractVector, ::Type{<:LabelEnc.OneOfK})
Minor naming convention: Why is lm
the naming convention for LabelEncodings variables? Would lenc
be more suitable?
Design principles for single value, binary classification:
value
argument.This would result in something like this:
function classify(value, cutoff::Function, lenc::BinaryLabelEncoding)
cutoff(value) ? poslabel(lenc) : neglabel(lenc)
end
function classify(value, cutoff::Number, lenc::BinaryLabelEncoding)
classify(value, x -> x > cutoff, lenc)
end
# some defaults
classify(value::Number, lenc::LabelEnc.ZeroOne) = classify(value, larger_than_cutoff(lenc.cutoff), lenc)
classify(value::Number, lenc::LabelEnc.MarginBased) = classify(value, signbit, lenc)
classify(value::Number, lenc::LabelEnc.NativeLabels{T,2,F}) where {T,F} = classify(value, 0.5, lenc)
# If there should still be "defaults" for non-initialized LabelEnc
classify(value::Number, lenc::Type{LabelEnc.ZeroOne{T}}) where T = classify(value, 0.5, lenc())
classify(value::Number, lenc::Type{LabelEnc.MarginBased{T}}) where T = classify(value, x -> !signbit(x), lenc())
With some modification this also works for vector-based classification, like OneOfK.
I would appreciate feedback on this first, before I put more implementation effort into this.
I get this
julia> x=[1 0; 0 1]
2×2 Array{Int64,2}:
1 0
0 1
julia> label2ind(x, LabelEnc.OneOfK(2))
ERROR: MethodError: no method matching isposlabel(::Array{Int64,2}, ::MLLabelUtils.LabelEnc.OneOfK{Int64,2})
Closest candidates are:
isposlabel(::Bool, ::MLLabelUtils.LabelEnc.OneOfK) at /home/carlo/.julia/dev/MLLabelUtils/src/labelencoding.jl:264
isposlabel(::Number, ::MLLabelUtils.LabelEnc.OneOfK{T,2}) where T at /home/carlo/.julia/dev/MLLabelUtils/src/labelencoding.jl:269
isposlabel(::AbstractArray{#s5,1} where #s5<:Number, ::MLLabelUtils.LabelEnc.OneOfK{T,2}) where T at /home/carlo/.julia/dev/MLLabelUtils/src/labelencoding.jl:270
...
Stacktrace:
[1] label2ind(::Array{Int64,2}, ::MLLabelUtils.LabelEnc.OneOfK{Int64,2}) at /home/carlo/.julia/dev/MLLabelUtils/src/labelencoding.jl:217
[2] top-level scope at none:0
while I was expecting label2ind(x, LabelEnc.OneOfK(2)) == [1,2]
, right?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.