Giter Site home page Giter Site logo

distances.jl's People

Contributors

alyst avatar andreasnoack avatar ararslan avatar aviatesk avatar devmotion avatar dkarrasch avatar femtocleaner[bot] avatar iainnz avatar isakfalk avatar jlapeyre avatar johnmyleswhite avatar johnnychen94 avatar juliohm avatar kescobo avatar kristofferc avatar lindahua avatar mcfefa avatar mkborregaard avatar nalimilan avatar ntdef avatar pauljurczak avatar rashidrafeek avatar rawls238 avatar richardreeve avatar rmcaixeta avatar stefankarpinski avatar suvarzz avatar timholy avatar tkelman avatar torfjelde avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

distances.jl's Issues

Error in missing

Something like this used to work in 0.6, but now fails.

julia> a = Any[0,0,0,1]
4-element Array{Any,1}:
 0
 0
 0
 1

julia> b = Any[1,0,0,1]
4-element Array{Any,1}:
 1
 0
 0
 1

julia> hamming(a, b)
ERROR: MethodError: no method matching one(::Type{Any})
Closest candidates are:
  one(::Type{Union{Missing, T}}) where T at missing.jl:83
  one(::Missing) at missing.jl:79
  one(::BitArray{2}) at bitarray.jl:392
  ...
Stacktrace:
 [1] one(::Type{Any}) at ./missing.jl:83
 [2] result_type(::Hamming, ::Array{Any,1}, ::Array{Any,1}) at /Users/adrian/.julia/dev/Distances/src/metrics.jl:194
 [3] eval_start(::Hamming, ::Array{Any,1}, ::Array{Any,1}) at /Users/adrian/.julia/dev/Distances/src/metrics.jl:196
 [4] evaluate at /Users/adrian/.julia/dev/Distances/src/metrics.jl:159 [inlined]
 [5] hamming(::Array{Any,1}, ::Array{Any,1}) at /Users/adrian/.julia/dev/Distances/src/metrics.jl:240
 [6] top-level scope at none:0

Per this thread: https://discourse.julialang.org/t/status-of-distances-jl/16789

Feature Request: evaluate pairwise distances on subset of columns indexed by a vector

Hi,

I have a "large" dataset and I would like to calculate the pairwise distances only on a subset of the columns, indexed by a vector of Ints. The obvious thing to do is to code up the double loop explicitly myself but I though It might be a useful feature to have in the package itself.

How hard would it be to implement it thoughout the package?

Add a generic distributed pairwise(!) function?

I was wondering if something along the following lines would be of general interest:

function distributedpairwise!(r::SharedMatrix, metric::SemiMetric, a::AbstractMatrix)
    n = size(a, 2)
    size(r) == (n, n) || throw(DimensionMismatch("Incorrect size of r."))
    @inbounds for j = 1:n
        aj = view(a, :, j)
        @sync @distributed for i = (j + 1):n # have parallel computation here to avoid uneven work load
            r[i, j] = evaluate(metric, view(a, :, i), aj)
        end
        r[j, j] = 0
        for i = 1:(j - 1)
            r[i, j] = r[j, i]   # leveraging the symmetry of SemiMetric
        end
    end
    r
end

and similarly for pairwise(::SemiMetric, a, b)? Or the other way around, is there a performance loss to be expected if the above was to replace the current pairwise! implementation, which would neatlessly work even if only one worker was available? Note that even the latter would not affect specialized pairwise instances like for Euclidean().

RFC: Remove dependency on ArrayViews and use slice instead.

This could potentially fix #5 since slices works with sparse matrices.

From ArrayViews.jl

Reasons to prefer ArrayViews:

Construction of SubArrays is frequently (but not always) 2-4 times slower than construction of views. If you are constructing many column views, ArrayViews may still be the better choice.

However, the construction times of the views are not a bottle neck here so there seems to me to be no reason to keep the ArrayViews dependency.

Hausdorff distance

I want to make the Hausdorff distance available in Julia. Is the Distances.jl package a good fit?

I can think of a very simple (naive) implementation that works surprisingly well:

using Distances

A, B # matrices which columns represent the points in the pointset
D = pairwise(Euclidean(), A, B)
daB = maximum(minimum(D,2))
dbA = maximum(minimum(D,1))
result = max(daB, dbA)

Implement Haversine

Haversine is a very useful distance when dealing with geospatial data. It is implemented in libraries such as scikit-learn, and it's pretty stardard today.

Distance.jl and Distances.jl

Could you please confirm Distances.jl is deprecated and that we should use the Distance.jl instead?

Please delete one of these repos to avoid confusion.

Please tag latest version in METADATA

I am maintaining GaussianMixtures, which depends on Clustering, which depends on Distances. Because julia-0.5 warnings are extremely slow, my GaussianMixtures tests never end because of a deprecation warning that has already been solved in #48 .

This problem would be solved if Distances tags 152d6cc as a new minor version in METADATA. Then I can get GaussianMixtures up to date with julia v0.5.

Thanks.

colwise ridiculously slow for small columns due to allocating array views

A quite common need is to have one (or two) data matrices, and wanting to compute distances between certain columns. Using Distances.jl is nice, because it supplies an API, so that users can plug different distance functions into an algorithm. As example, consider NearestNeighbors.jl.

Hence, one needs an API to compute evaluate(dist, X[:,i1], Y[:,i2]).

Unfortunately, Distances.jl uses array views. Array views are harmful, because they allocate (and add indirection!). Until this is fixed, one must provide a zero-overhead way of computing such distances.

A simple API would be evaluate(dist, X, i1, Y, i2). Due to limitations of julia, we cannot bundle the underlying matrix and the indices into a struct or tuple (this is what array views do). This constraint is unfortunate, but I don't think there is any way to have a nice and fast API. As far as I understood, this will not change in julia 1.0; hence an ugly-but-fast API is necessary (if you consider allocating array views a bug, then a workaround is needed).

I am not sure about the final API design; hence no pull request. However, I expect that a lot of packages downstream will want to make use of this.

Indeed, the broken API is used even internally in colwise.

I attached an example benchmark, comparing colwise against a naive loop for euclidean distances between 10-dimensional points. Lower dimensional data has even worse relative speed differences; higher dimensional data reduces the difference.

using Distances
using BenchmarkTools

m = 10
n=100_000
Xcol = rand(m,n)
Ycol=rand(m,n)


function colwise_naive(X,Y)
    m = size(X,1)
    n = size(X,2)
    T = eltype(X)
    assert(m == size(Y,1) && n == size(Y,2) && T==eltype(Y))

    res = Vector{T}(n)
    @inbounds  for j = 1:n
        s_ = zero(T)
        @inbounds @simd for i = 1:m
            dd_ = X[i,j]-Y[i,j]
            s_ += dd_*dd_
        end
        res[j] = sqrt(s_)
    end
    res
end

dd = Euclidean()
bench_colw = @benchmark colwise(dd, Xcol,Ycol)
bench_colw2 = @benchmark colwise_naive(Xcol,Ycol)

println("distances.jl colwise")
show(STDOUT, MIME"text/plain"(), bench_colw)

println("\nnaive colwise")
show(STDOUT, MIME"text/plain"(), bench_colw2)

yielding

colwise
BenchmarkTools.Trial: 
  memory estimate:  9.92 MiB
  allocs estimate:  200002
  --------------
  minimum time:     5.914 ms (0.00% GC)
  median time:      6.522 ms (0.00% GC)
  mean time:        6.724 ms (4.03% GC)
  maximum time:     15.262 ms (0.00% GC)
  --------------
  samples:          742
  evals/sample:     1
naive colwise
BenchmarkTools.Trial: 
  memory estimate:  781.33 KiB
  allocs estimate:  2
  --------------
  minimum time:     1.636 ms (0.00% GC)
  median time:      1.726 ms (0.00% GC)
  mean time:        1.932 ms (0.46% GC)
  maximum time:     8.783 ms (0.00% GC)
  --------------
  samples:          2573
  evals/sample:     1

PS. The combination of evaluate(dist, X, i1, Y, i2) and evaluate(dist, X, Y) [for vectors] works when the distance evaluation is inlined. If it is not inlined, then one should consider an API evaluate(dist, Xptr, Yptr, len), in order to avoid avoidable indirection for Arrays and data copying for SVectors.

[PkgEval] Distances may have a testing issue on Julia 0.4 (2015-06-23)

PackageEvaluator.jl is a script that runs nightly. It attempts to load all Julia packages and run their tests (if available) on both the stable version of Julia (0.3) and the nightly build of the unstable version (0.4). The results of this script are used to generate a package listing enhanced with testing results.

On Julia 0.4

  • On 2015-06-22 the testing status was Tests pass.
  • On 2015-06-23 the testing status changed to Tests fail.

This issue was filed because your testing status became worse. No additional issues will be filed if your package remains in this state, and no issue will be filed if it improves. If you'd like to opt-out of these status-change messages, reply to this message saying you'd like to and @IainNZ will add an exception. If you'd like to discuss PackageEvaluator.jl please file an issue at the repository. For example, your package may be untestable on the test machine due to a dependency - an exception can be added.

Test log:

>>> 'Pkg.add("Distances")' log
INFO: Installing ArrayViews v0.6.2
INFO: Installing Distances v0.2.0
INFO: Package database updated

>>> 'Pkg.test("Distances")' log
INFO: Testing Distances
Running tests ...
* test_dists.jl ...
ERROR: LoadError: LoadError: syntax: local declaration in global scope
 in include at ./boot.jl:254
 in include_from_node1 at ./loading.jl:133
 in anonymous at no file:8
 in include at ./boot.jl:254
 in include_from_node1 at loading.jl:133
 in process_options at ./client.jl:304
 in _start at ./client.jl:404
while loading /home/vagrant/.julia/v0.4/Distances/test/test_dists.jl, in expression starting on line 175
while loading /home/vagrant/.julia/v0.4/Distances/test/runtests.jl, in expression starting on line 5

==============================[ ERROR: Distances ]==============================

failed process: Process(`/home/vagrant/julia/bin/julia --check-bounds=yes --code-coverage=none --color=no /home/vagrant/.julia/v0.4/Distances/test/runtests.jl`, ProcessExited(1)) [1]

================================================================================
INFO: No packages to install, update or remove
ERROR: Distances had test errors
 in error at ./error.jl:21
 in test at pkg/entry.jl:746
 in anonymous at pkg/dir.jl:31
 in cd at file.jl:22
 in cd at pkg/dir.jl:31
 in test at pkg.jl:71
 in process_options at ./client.jl:280
 in _start at ./client.jl:404


>>> End of log

Regression of colwise on Julia 0.7?

I was testing some code of mine in Julia 0.7, and found a rather drastic regression.

On Julia 0.7:

julia> using Distances, BenchmarkTools

julia> A = rand(2,41); B = rand(2,41);

julia> @benchmark colwise($(Euclidean()), $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  4.28 KiB
  allocs estimate:  83
  --------------
  minimum time:     634.094 ns (0.00% GC)
  median time:      671.082 ns (0.00% GC)
  mean time:        786.797 ns (12.98% GC)
  maximum time:     176.285 μs (99.54% GC)
  --------------
  samples:          10000
  evals/sample:     170

julia> d = fill(0., 41);

julia> @benchmark colwise!($d,$(Euclidean()), $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  3.84 KiB
  allocs estimate:  82
  --------------
  minimum time:     567.255 ns (0.00% GC)
  median time:      610.462 ns (0.00% GC)
  mean time:        753.263 ns (16.55% GC)
  maximum time:     175.393 μs (99.61% GC)
  --------------
  samples:          10000
  evals/sample:     184

julia> versioninfo()
Julia Version 0.7.0-beta2.0
Commit b145832402* (2018-07-13 19:54 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

(note the allocations even for colwise!), and for comparison

julia> @benchmark colwise($(Euclidean()), $A, $B)
BenchmarkTools.Trial:
  memory estimate:  448 bytes
  allocs estimate:  1
  --------------
  minimum time:     246.176 ns (0.00% GC)
  median time:      251.361 ns (0.00% GC)
  mean time:        278.406 ns (2.41% GC)
  maximum time:     2.926 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     386

julia> d = zeros(41);

julia> @benchmark colwise!($d,$(Euclidean()), $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     202.838 ns (0.00% GC)
  median time:      204.926 ns (0.00% GC)
  mean time:        216.499 ns (0.00% GC)
  maximum time:     1.503 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     580

julia> versioninfo()
Julia Version 0.6.4
Commit 9d11f62bcb (2018-07-09 19:09 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)

Document clearly where numeric constancy with naive implementations is violated

I think it is worth documenting that the optimisation done by Distances.jl
can cause disagreement with naive implementations.

Possibly with some advice to not mix them.

Eg

srand(1)
x= rand(50)
y= rand(50)

euclidean(x,y) - sqrt(sum((x .- y).^2))

``
result is `-4.440892098500626e-16`
rather than `0.00`


On the other hand perhaps one should never assume floats are equal

cosine distance formula

In your documentation, you mention that the cosine distance is calculate as:

1 - dot(x, y) / (norm(x) * norm(y))

However, I think that the correct expression would be

vecdot(x, y) / (norm(x) * norm(y))

Is it accurate?

Thanks

Ellipsoid distance

The Mahalanobis distance is very useful when we have a covariance/precision matrix to weight the observations. In many other applications where a covariance doesn't exist, however; we still have the opportunity to introduce a geometrical view by constructing a matrix based on rotation and scaling of the system axes.

The Ellipsoid distance I implemented in GeoStats.jl does exactly this: https://github.com/juliohm/GeoStats.jl/blob/master/src/distances.jl#L38-L105

It is basically a helper constructor that given directions and scaling coefficients, builds the corresponding Mahalanobis matrix. Can I submit a PR that adds this distance to the package? My plan is to just reuse the code for Mahalanobis. The new struct Ellipsoid would just contain a Mahalanobis that is initialized properly.

Please let me know if that is welcome, and I will work on it during the week.

Implement Bray-Curtis dissimilarity?

Bray-Curtis is a semi-metric often used in ecology.

The implementation for two vectors i and j is pretty straightforward:

function braycurtis(i::AbstractVector, j::AbstractVector)
    length(i) == length(j) || error("Vectors must be same length")
    return sum(abs.(i .- j))/sum(i .+ j)
end

I was looking around in the source code, and I'd be happy to try to submit a PR but I'm not entirely confident I understand what's going on with all of the @inbounds and @eval stuff...

Should Mahalanobis invert its matrix argument (qmat)?

This is more of a question about the reasoning behind this decision than an actual issue. Every definition I've seen for Mahalanobis distance is defined using the covariance matrix S like:

d = sqrt( (x-m)' * inv(S) * (x-m) )

Here, I think of d as a function of x, m, and S (as does R, apparently). However, the current implementation in this package treats it as a function of x, m, and qmat = inv(S). This is, of course, just a matter of perspective, but I found it unintuitive at first.

Need for custom floating point precision due to memory constraints

I am making use of pairwise(Euclidean(), X, Y) with size(X,2) = size(Y,2) ~= 50000. It seems that the current implementation always sets the result_type to Float64: https://github.com/JuliaStats/Distances.jl/blob/master/src/generic.jl#L27

My workstation has 8GB of RAM and therefore I cannot proceed with such precision. Could you please consider adding an option for custom result types? Float16 would solve the problem.

Thank you.

pairwise CosineDistance with Int matrices as argument throws error

Here's a quick example:

using Distances
A = rand(1:100, (100, 100));
pairwise(CosineDist(), A)

ERROR: InexactError: Int64(Int64, 595.1344385934997)
Stacktrace:
 [1] Type at ./float.jl:700 [inlined]
 [2] convert at ./number.jl:7 [inlined]
 [3] setindex! at ./array.jl:769 [inlined]
 [4] macro expansion at /Users/isaac/.julia/packages/Distances/nLAdT/src/common.jl:96 [inlined]
 [5] macro expansion at ./simdloop.jl:73 [inlined]
 [6] sqrt!(::Array{Int64,1}) at /Users/isaac/.julia/packages/Distances/nLAdT/src/common.jl:95
 [7] pairwise!(::Array{Float64,2}, ::CosineDist, ::Array{Int64,2}) at /Users/isaac/.julia/packages/Distances/nLAdT/src/metrics.jl:596
 [8] pairwise(::CosineDist, ::Array{Int64,2}) at /Users/isaac/.julia/packages/Distances/nLAdT/src/generic.jl:127
 [9] top-level scope at none:0

It looks like this is happening because of this line, since sumsq_percol returns return a integer vector for an integer matrix. I did a little benchmarking to see the costs of using sqrt. instead of sqrt! version, and don't see much difference.

Would it make sense to move StatsBase's distances/divergences here?

Currently StatsBase defines the L2, squared L2, L1, and L-inf distances, as well as the generalized Kullback-Leibler divergence, and the mean absolute, max absolute, mean squared, and root mean squared deviations. It seems to me that those may be better served in a package dedicated to distance metrics such as this one, particularly since other functions in StatsBase don't appear to rely on them. Thoughts?

pairwise to work on rows instead of columns.

Say I have a large matrix X in memory
I would like to use pairwise operating on the rows instead of the columns, without having to pass the transpose of X (so to not allocate another object in memory).

`CosineDist` fails with Ints

Just ran into this:

julia> evaluate(CosineDist(),[1,0],[1,0])
ERROR: BoundsError
 in indexed_next at ./tuple.jl:35 [inlined]
 in eval_reduce(::Distances.CosineDist, ::Float64, ::Tuple{Int64,Int64,Int64}) at /Users/alex/.julia/v0.5/Distances/src/metrics.jl:140
 in macro expansion at /Users/alex/.julia/v0.5/Distances/src/metrics.jl:74 [inlined]
 in macro expansion at ./simdloop.jl:73 [inlined]
 in evaluate(::Distances.CosineDist, ::Array{Int64,1}, ::Array{Int64,1}) at /Users/alex/.julia/v0.5/Distances/src/metrics.jl:71

It works if the vectors contain floats:

julia> evaluate(CosineDist(),[1,0.],[1,0.])
0.0

compatibility with OffsetArrays

Currently not compatible with OffsetArrays, because of calls to size (possibly among other things):

julia> using Distances, OffsetArrays

julia> x,y = rand(10), rand(10)
([0.122568, 0.558197, 0.890631, 0.622067, 0.549109, 0.285662, 0.143514, 0.987727, 0.971258, 0.287773], [0.429365, 0.433621, 0.287694, 0.452117, 0.872341, 0.442937, 0.0165311, 0.74837, 0.648707, 0.0330926])

julia> corr_dist(x,y)
0.44865252243622267

julia> xoff = OffsetVector(x, -5:4)
OffsetArrays.OffsetArray{Float64,1,Array{Float64,1}} with indices -5:4:
 0.122568
 0.558197
 0.890631
 0.622067
 0.549109
 0.285662
 0.143514
 0.987727
 0.971258
 0.287773

julia> corr_dist(xoff, y)
ERROR: size not supported for arrays with axes (-5:4,); see http://docs.julialang.org/en/latest/devdocs/offset-arrays/
Stacktrace:
 [1] errmsg(::OffsetArrays.OffsetArray{Float64,1,Array{Float64,1}}) at /home/dave/.julia/v0.6/OffsetArrays/src/OffsetArrays.jl:91
 [2] evaluate at /home/dave/.julia/v0.6/Distances/src/metrics.jl:170 [inlined]
 [3] cosine_dist at /home/dave/.julia/v0.6/Distances/src/metrics.jl:257 [inlined]
 [4] evaluate at /home/dave/.julia/v0.6/Distances/src/metrics.jl:261 [inlined]
 [5] corr_dist(::OffsetArrays.OffsetArray{Float64,1,Array{Float64,1}}, ::Array{Float64,1}) at /home/dave/.julia/v0.6/Distances/src/metrics.jl:264

This could probably be fixed pretty easily by the suggestion from the OffsetArrays readme to use internal helper functions _size(A::AbstractArray) = map(length, indices(A)).

RFC: type -> immutable

Would it make sense for the types that has internal fields (like Minkowski) to be immutables to elide unnecessary loads?

ChiSqDist generates NaN

In case where both vectors have a 0 at the same position, the implementation of the Chi-square distance produces NaN since it divides by 0 at this point. I would suggest to change the implementation from

@inline eval_op(::ChiSqDist, ai, bi) = abs2(ai - bi) / (ai + bi)

to

@inline eval_op(::ChiSqDist, ai, bi) = (ai + bi) > zero(ai + bi) ? abs2(ai - bi) / (ai + bi) : zero(ai + bi)

This would implicitly define the distance between 0 and 0 as 0, which would be consistent with the overall behavior.

Evolutionary Distances

Hi,

I'm about to implement computation of evolutionary distances between DNA sequences for Bio.jl: BioJulia/Bio.jl#228 using several common nucleotide substitution models (https://en.wikipedia.org/wiki/Models_of_DNA_evolution), like the dist.dna function in the 'ape' R package does: https://github.com/cran/ape/blob/master/R/DNA.R

Other population-based measures of genetic distance may also be implemented in the future: https://en.wikipedia.org/wiki/Genetic_distance

I could just create a separate API / set of methods for these specific distance calculations in Bio.jl, but if applicable I'd like to extend the framework already laid out by Distances.jl, implementing only the types/methods in Bio.jl needed to extend what currently exists. So I've come to ask for comments and advice on doing this from folks working on JuliaStats / Distances.jl

The Bio.jl types I want to compute evolutionary distances for are BioSequence's and PairwiseAlignment's.

I suspect that the BioSequences would work well: it can be thought of as an array or vector of elements, however, PairwiseAlignments are a bit more complicated and unique data structure, and the addition of substitution matrices and other models of molecular evolution further make me question whether such extension is possible cf a Bio.jl specific set of types and methods for distances. Bio.jl PR for this is: BioJulia/Bio.jl#228 (comment), comments and suggestions are most welcome,

Thanks,
Ben W.

Minkowski with p<1 not a metric

julia> using Distances
julia> d = Minkowski(0.5);
julia> a=[1.0,0.0]; b=[0.0,1.0]; c=[0.0,0.0];
julia> evaluate(d, a, b) - evaluate(d,a,c) - evaluate(d,c,b)
2.0
julia> d isa Metric
true

This violates the triangle inequality. For p<1 we should probably throw in the constructor, or skip the outer root (then L^p for p<1 can be considered a metric vector space, albeit not locally convex). Skipping the outer root means something like

julia> using Distances
julia> import Distances.evaluate
julia> struct M<:Metric end
julia> evaluate(::M, x, y) = sum(t->sqrt(abs(t)) , x-y)

How to cite license

I'm reusing ideas from Distances' type system and evalute function to make norms, i.e., essentially distances of one argument instead of two.
As such, I think it's appropriate that I cite Distances' license somewhere, but I'm unsure of the location.

  • Should it be in the LICENSE.md of my package?
  • Mentioned somewhere in the README of my package`?
  • Be included in the file where my evaluate method is?

Memory leak?

Hi,

I'm computing pairwise distances for a large matrix (500 x 100,000) - the memory footprint keeps growing indefinitely. I am pre-allocating the output matrix so I don't think that should be happening. In fact, the process eventually get killed by the kernel (after using 60Gb+ of men)... I suspect a memory leak. The code I'm running looks something like

using Base.Threads
using Distances
using JLD


function main(data)

    nvectors = size(data,2)
    js = Matrix{Float64}(nvectors, nvectors)

    pairwise!(js, JSDivergence(), data, data)

    println("Done computing distances")

    writedlm("./mallet_composition_500_JS.txt", js)
end


@time const data = jldopen("./mallet_composition_500.jld", "r") do file
    read(file, "data")
end
data_small = data[:, 1:100]
@time main(data_small)
@time main(data)

Any insights? I haven't profiled for memory leaks in Julia before, but I'll try to see if I can help with more specifics.

Thanks!

Adding Bregman Divergences (Y/N)

A divergence between vectors I've used sometimes in learning contexts is the Bregman divergence. The idea is you take a convex C^1 function and two distributions, and do a kind of Taylor expansion of the function at p, evaluated at q.

This generalizes a few other divergences. For example, you get the generalized KL divergence by using entropy as your function, the Euclidean distance by using the norm squared, etc. An example of an application is this set of OCW notes, which defines the divergence and gives an application to the Mirror Descent algorithm.

If this seems worth adding, I'll submit a PR sometime over the weekend or something.

Temporal Distance?

The idea would be to include a temporal distance for Date objects which returns the distance based on various methods. Here is a sketch of the idea.

using Dates
abstract type TemporalArithmetic end
struct CalendricalDistance <: TemporalArithmetic end
struct TemporalDistance <: TemporalArithmetic end
struct PeriodDistance <: TemporalArithmetic
    valid::Function
end
evaluate(dist::Temporal, x::Date, y::Date) = evaluate(dist.method, x, y)
function evaluate(dist::CalendricalDistance, x::Date, y::Date)
    output = yearmonthday(max(x,y)) .- yearmonthday(min(x,y))
    output = Dates.canonicalize(Year(output[1]) + Month(output[2]) + Day(output[3]))
end
evaluate(dist::TemporalDistance, x::Date, y::Date) = abs(x - y)
evaluate(dist::PeriodDistance, x::Date, y::Date) = sum(dist.valid.(min(x,y) + Day(1):max(x,y)))

where a valid PeriodDistance can be constructed for period arithmetic à la

periodselector(obj::Date) = Dates.dayofweek(obj) == Dates.Monday
evaluate(Temporal(periodselector), x, y)

UndefVarError: Haversine not defined

Not sure if I am making something wrong?

   x₁ = rand(2)
    x₂ = rand(2)
    x₃ = rand(2)

    evaluate(Haversine(6371.0), x₁, x₂, x₃)

or 

 evaluate(Haversine(6371.0), x₁, x₂)

gives

UndefVarError: Haversine not defined

Stacktrace:
 [1] include_string(::String, ::String) at ./loading.jl:522

on both my local installation as well as juliabox but

 evaluate(Euclidean(), x₁, x₂)

works. Not sure whats going on. Ideas appreciated.

Making distances callable

It would be nice if distances were callable, i.e. (d::Metric)(x,y)=evaluate(d, x, y). Unfortunately this line does not fly, due to JuliaLang/julia#14919.

One could do evaluate(d::Metric, x, y)=d(x,y) and add the requisite definitions to the concrete metrics (that way, metric-space code using the old API would be compatible with both new and old API, and new-style distances would be compatible with both new-style and old-style distances).

Reason is that this would allow Distances.jl to be compatible with the "natural API" (the API everyone working with distances reinvents, where a distance is a function), without requiring Distances.jl as a dependency. I am pretty unsure whether Metric<:Function would be sensible.

Add `@simd` to inner loops

In my local copy I added @simd to the loop in sumsqdiff and it made a noticeable difference. Would be good to inspect other loops and see if @simd is applicable.

Slowdown in bench_pairwise and bench_colwise for Julia v0.6

Running the benchmark tools in test/ shows the the basic looping code (not the colwise and pairwise code) is much slower (3-4x) in Julia 0.6 than it was in 0.5... does anyone know what's going on?

v.0.5

distance loop colwise gain
SqEuclidean 0.007357s 0.001989s 3.6983
Euclidean 0.007447s 0.002041s 3.6488
Cityblock 0.007344s 0.001986s 3.6977
...

v0.6

distance loop colwise gain
SqEuclidean 0.027896s 0.001776s 15.7099
Euclidean 0.029269s 0.001811s 16.1622
Cityblock 0.029836s 0.001762s 16.9368
...

Doc macro makes compilation fail

I'm trying to compile some packages (using ApplicationBuilder/PackageCompiler) and this line makes the process fail:

https://github.com/JuliaStats/Distances.jl/blob/master/src/metrics.jl#L328

ERROR: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: UndefRefError: access to undefined reference
Stacktrace:
 [1] top-level scope at none:0 (repeats 4 times)
in expression starting at /Users/bieler/.julia/packages/Distances/nLAdT/src/metrics.jl:328

Is there another way to do this doc merging ?

Array lengths should be checked.

It looks like some of the routines should call get_common_len instead of length, otherwise a crash or silent corruption might ensue. Example:

julia> using Distances

julia> cosine_dist(zeros(1000000),zeros(1))

signal (11): Segmentation fault

Document use of colwise() and pairwise() functions for metrics that involve more arguments then jut two vectors

There is no mention of how to use pairwise() and colwise() family of functions with metrics, whose computation involve more arguments than just two vectors, such as Mahalanobis, which also requires covariance matrix.

I suppose pairwise() and colwise() family of functions do somehow support these metrics, since they are reported under the benchmark sections.

Could you specify how to use these functions with such metrics?
For example, in case of using pairwise() under the Mahalanobis metric should covariance matrices be passed as a vector of matrices(Vector{Matrices{T}})?

Colwise distance between matrix and vector does not work when both are integer vectors

The following throws the error below.

A = [1 2; 3 4]
v = [0 1]
colwise(Euclidean(), A, v)

ERROR: DimensionMismatch("first array has length 2 which does not match the length of the second, 1.")
in evaluate(::Distances.Euclidean, ::SubArray{Int64,1,Array{Int64,2},Tuple{Colon,Int64},true}, ::SubArray{Int64,1,Array{Int64,2},Tuple{Colon,Int64},true}) at /Users/tdefreitas/.julia/v0.5/Distances/src/metrics.jl:64
in colwise!(::Array{Float64,1}, ::Distances.Euclidean, ::Array{Int64,2}, ::Array{Int64,2}) at /Users/tdefreitas/.julia/v0.5/Distances/src/generic.jl:54
in colwise(::Distances.Euclidean, ::Array{Int64,2}, ::Array{Int64,2}) at /Users/tdefreitas/.julia/v0.5/Distances/src/generic.jl:66

I also tried colwise(Euclidean(), A, v') and I get a similar error.

ERROR: DimensionMismatch("The number of columns in a and b must match.")
in get_common_ncols(::Array{Int64,2}, ::Array{Int64,2}) at /Users/tdefreitas/.julia/v0.5/Distances/src/common.jl:11
in colwise(::Distances.Euclidean, ::Array{Int64,2}, ::Array{Int64,2}) at /Users/tdefreitas/.julia/v0.5/Distances/src/generic.jl:64

I get the same results with other distance metrics, and when I swap A and v. Distanecs.jl seems to not play well with Int64 matrices in general. I'm all for writing a fix, if that's desired.

Test failure in the logs

I saw the following test failures in the logs. I can't reproduce it, but since this package uses random numbers, I figured it may still be real. @andreasnoack

ERROR: LoadError: LoadError: Some tests did not pass: 339 passed, 3 failed, 0 errored, 0 broken.
in expression starting at /root/.julia/packages/Distances/Ge9SA/test/test_dists.jl:40
in expression starting at /root/.julia/packages/Distances/Ge9SA/test/runtests.jl:9
Test metricity of RogersTanimoto: Test Failed at /root/.julia/packages/Distances/Ge9SA/test/test_dists.jl:14
  Expression: evaluate(dist, y, y) + one(eltype(y)) ≈ one(eltype(y))
   Evaluated: NaN ≈ true
Stacktrace:
 [1] test_metricity(::RogersTanimoto, ::Array{Bool,1}, ::Array{Bool,1}, ::Array{Bool,1}) at /root/.julia/packages/Distances/Ge9SA/test/test_dists.jl:14
 [2] macro expansion at /root/.julia/packages/Distances/Ge9SA/test/test_dists.jl:81 [inlined]
 [3] top-level scope at /workspace/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
 [4] top-level scope at ./none:0
 [5] include at ./boot.jl:317 [inlined]
 [6] include_relative(::Module, ::String) at ./loading.jl:1038
 [7] include(::Module, ::String) at ./sysimg.jl:29
 [8] include(::String) at ./client.jl:388
 [9] top-level scope at none:0
 [10] include at ./boot.jl:317 [inlined]
 [11] include_relative(::Module, ::String) at ./loading.jl:1038
 [12] include(::Module, ::String) at ./sysimg.jl:29
 [13] include(::String) at ./client.jl:388
 [14] top-level scope at none:0
 [15] eval(::Module, ::Any) at ./boot.jl:319
 [16] macro expansion at ./logging.jl:317 [inlined]
 [17] exec_options(::Base.JLOptions) at ./client.jl:219
 [18] _start() at ./client.jl:421
Test metricity of BrayCurtis: Test Failed at /root/.julia/packages/Distances/Ge9SA/test/test_dists.jl:14
  Expression: evaluate(dist, y, y) + one(eltype(y)) ≈ one(eltype(y))
   Evaluated: NaN ≈ true
Stacktrace:
 [1] test_metricity(::BrayCurtis, ::Array{Bool,1}, ::Array{Bool,1}, ::Array{Bool,1}) at /root/.julia/packages/Distances/Ge9SA/test/test_dists.jl:14
 [2] macro expansion at /root/.julia/packages/Distances/Ge9SA/test/test_dists.jl:82 [inlined]
 [3] top-level scope at /workspace/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
 [4] top-level scope at ./none:0
 [5] include at ./boot.jl:317 [inlined]
 [6] include_relative(::Module, ::String) at ./loading.jl:1038
 [7] include(::Module, ::String) at ./sysimg.jl:29
 [8] include(::String) at ./client.jl:388
 [9] top-level scope at none:0
 [10] include at ./boot.jl:317 [inlined]
 [11] include_relative(::Module, ::String) at ./loading.jl:1038
 [12] include(::Module, ::String) at ./sysimg.jl:29
 [13] include(::String) at ./client.jl:388
 [14] top-level scope at none:0
 [15] eval(::Module, ::Any) at ./boot.jl:319
 [16] macro expansion at ./logging.jl:317 [inlined]
 [17] exec_options(::Base.JLOptions) at ./client.jl:219
 [18] _start() at ./client.jl:421
Test metricity of Jaccard: Test Failed at /root/.julia/packages/Distances/Ge9SA/test/test_dists.jl:14
  Expression: evaluate(dist, y, y) + one(eltype(y)) ≈ one(eltype(y))
   Evaluated: NaN ≈ true

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.