Comments (13)
The package right now is heavily focused towards computing things on a set of geometric points but as you say, you can compute distances between other objects.
And yes, the correct input should just be a list of things where distance(thing1, thing2)
is defined. It might be worth making something like:
immutable Point{N, T}
coordinates::NTuple{N, T}
end
One can then do:
julia> p = rand(3, 5)
3x5 Array{Float64,2}:
0.310207 0.783898 0.780439 0.607608 0.0795108
0.221908 0.53741 0.553835 0.373283 0.522569
0.291331 0.210073 0.0570539 0.809761 0.397538
julia> reinterpret(Point{3, Float64}, p, (5,))
5-element Array{Point{3,Float64},1}:
Point{3,Float64}((0.31020720319673756,0.22190808492334102,0.2913307988933631))
Point{3,Float64}((0.783898246452825,0.5374103691511933,0.21007290525123268))
Point{3,Float64}((0.7804389420189257,0.5538345595214069,0.0570539417966307))
Point{3,Float64}((0.6076080127056103,0.3732832928705745,0.8097605299871604))
Point{3,Float64}((0.07951078253622623,0.5225687936560375,0.3975381159479192))
To quickly convert between a matrix to a vector of points. This is not type stable but the cost is only paid once by making sure to use a function barrier.
from nearestneighbors.jl.
Thanks for all these answers. Would you be willing to pay a performance penalty for those cases where the data could in fact be presented in an (d,n)
array (probably the overwhelming majority of cases), or are you thinking allowing both interfaces would be the way to go?
from nearestneighbors.jl.
I wouldn't like keeping both the interfaces, would be too much code duplication and awkward to test everything. I would just convert a Matrix{T} -> Vector{Point{N, T}}
the first thing I do with a function barrier. For a sufficient number of input points (not that large) the cost should be insignificant. It can just be documented as well so that if the number of points is low the user has to create the Vector{Point{N,T}}
him/herself.
This issue is actually a bit related: #17
from nearestneighbors.jl.
Ok. I would be a little worried that a performance penalty would show up beyond the initial reinterpret, conceptually whenever slice(data,:,i)
would now be data[i]
, but you certainly know better than I do.
from nearestneighbors.jl.
data[i]
is cheap and does not cause any heap allocation like slicing into an array. In fact it would be faster than slice
because slice
points to a heap allocated object and must thus also be allocated on the heap.
As an example:
immutable Point{N, T} <: AbstractVector{T}
coordinates::NTuple{N, T}
end
Base.length{N}(::Point{N}) = N
Base.size{N}(::Point{N}) = (N,)
Base.linearindexing(::Type{Point}) = Base.LinearFast()
function Base.getindex(p::Point, i::Int)
@inbounds v = p.coordinates[i]
return v
end
function compare{N, T}(A1::Matrix{T}, A2::Matrix{T},
V1::Vector{Point{N, T}}, V2::Vector{Point{N, T}})
m = Euclidean()
@time begin
for i in 1:length(V1)
evaluate(m, V1[i], V2[i])
end
end
@time begin
for i in 1:size(A1, 2)
evaluate(m, slice(A1, :, i), slice(A2, :, i))
end
end
end
A1 = rand(3, 10^6)
A2 = rand(3, 10^6)
V1 = reinterpret(Point{3, Float64}, A1, (10^6,))
V2 = reinterpret(Point{3, Float64}, A2, (10^6,))
compare(A1, A2, V1, V2)
gives:
julia> compare(A1, A2, V1, V2)
0.012314 seconds
0.070496 seconds (4.00 M allocations: 152.588 MB, 26.75% gc time)
so using Points
is about 7 times faster than slicing.
from nearestneighbors.jl.
Very cool & illuminating! Thanks.
from nearestneighbors.jl.
Hi Kristoffer,
In my understanding create_bsphere
in hyperspheres.jl
assumes the sample space is something like a real vector space to find a center by point by taking an average of points. Of course you can't do that for abstract sample spaces (I guess you can't ever step out of the set of sample points in the abstract). Not sure if this is an implementation detail or a hard limit of the algorithm.
from nearestneighbors.jl.
Ball trees should work with arbitrary objects which has a defined metric. However, I am no expert no these stuff and I don't know how to compute the bounding hyper sphere nor how to split the points into sub spheres. Right now I am splitting them in the same way as for the KD trees where you sort the objects along a certain dimension but for general objects you don't even have he concept of a dimension.
from nearestneighbors.jl.
I see, thanks. I guess a refactor of the package towards more generic data would not be super useful without an abstract tree algorithm for metric spaces. Although, in order to separate the workload, I think the interface for generic data could be implemented with a fallback on BruteTree for abstract metric spaces with minimal work, as a start.
Assuming abstract metric space tree algorithms would suffer from a performance penalty (probable), would you be willing to pay that price for BallTrees, or would you prefer two implementations living side by side?
from nearestneighbors.jl.
I believe we could use multiple dispatch to overload things in such a way that we don't need two completely separate implementations. The knn
and inrange
functions probably don't need to be touched, only how the tree is built.
from nearestneighbors.jl.
A couple of things I found out after digging for a little time.
Unfortunately it looks like what we want for abstract metric spaces are "VP trees" and that the algorithm for traversing VP trees is not the same as the one for ball trees. Here is a description:
http://stevehanov.ca/blog/index.php?id=130
As a much quicker hack, I think dispatching a center()
function in create_bsphere
to either a vector space average (when applicable) or a naive random draw among all points will produce a valid-if-not-that-great search tree.
from nearestneighbors.jl.
This is better now and you can have a Vector of Widgets
. However, each Widget
need to be an AbstractVector
and have a defined eltype(Type{Widget})
and length(Type{Widget})
defined so perhaps not as general as you would like.
from nearestneighbors.jl.
Thanks Kristoffer, this looks great as always.
Implementing all the changes we mentioned in this thread proved a bit overwhelming for me, especially since I don't have your command of efficient generic programming. I ended up putting together a basic VP Tree algorithm for my own use. I just pushed some comments to it here.
The data is given as data::Vector{T}
. x::T
is a point in an abstract metric space: the only assumption is that we can evaluate(distance,x::T,y::T)
(x+y
is not necessarily defined, etc.).
from nearestneighbors.jl.
Related Issues (20)
- Querying a vector of NaN occasionally results in invalid indices HOT 5
- Bug: BoundsError when skipping all entries HOT 1
- periodic kdtree or balltree HOT 5
- Benchmarking parallel implementation leads to segfaults HOT 2
- No constructor for NNTree HOT 2
- Nearest Neighbors in a very large dataset. HOT 1
- mutual nearest neighbors HOT 3
- [FR] Classifier function
- BallTree and KDTree do not accept Vector{AbstractVector{Float64}} Type. Possible bug. HOT 1
- knn: skipped items output when there is a skip function has always the last index and not 0 index HOT 3
- `nn` lacking docstring HOT 1
- AssertionError in kmedoids alg
- bug: BoundsError when the skip function returns true for all points HOT 3
- Is there a reason sqeuclidean distance is not supported? HOT 4
- README.md Misleading Custom Metric Documentation
- Document that `inrangecount` also counts the point itself HOT 2
- [Question] Can you insert new data into an existing KDTree object? HOT 2
- Compilation time issues with very high dimensions HOT 3
- Reverse Cuthill-McKee ordering option HOT 1
- Querying number of distance evaluations HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nearestneighbors.jl.