erdavila / m-tree Goto Github PK
View Code? Open in Web Editor NEWA data structure for efficient nearest-neighbor queries.
License: MIT License
A data structure for efficient nearest-neighbor queries.
License: MIT License
An m-tree is a data structure which indexes objects according to their relative distances. It is efficient for nearest-neighbor queries. This implementation follows the content of the article http://www.vldb.org/conf/1997/P426.PDF, with the following highlights: * The data structure is the same described in the article. * The algorithm for adding data had adaptations to handle with some corner cases. * The algorithm for removing data was designed from scratch, since it was not described on the article. * Both query algorithms (by-range and k-nearest-neighbor) were merged into one. The query algorithm can have as criteria either the range or the maximum number of resulting items, or both. The algorithm processes the results as needed, as the resulting items are fetched. The results are returned in non-decreasing distance from the query data parameter. MTree is the core class that implements an m-tree. The data objects can be any object that are understood by the distance function. Some examples: * Data objects could be N-dimensional-space coordinates and the distance function could return the euclidean distance between two data objects. * Data objects could be regular strings and the distance function could calculate the edit (Levenshtein) distance between two strings. The distance function must be provided by the user. Two equal objects should not be added to the same MTree instance. There is no validation regarding this and, if done, the behavior of the tree is undefined. Given a data object type and a distance function, some aspects can directly impact the performance of an m-tree: the constraints on the number of children for the nodes and the support functions (split, promotion, and partition functions). An m-tree is implemented as a tree (really!? :-P) and the minimum and maximum number of children on each node can be customized. When the maximum capacity of a node is exceeded, a split function is used to split the node, which is replaced by two new nodes. The split function can be seen as being composed by a promotion function and a partition function. The promotion function must choose (promote) two children from the node that will be split. The partition function must partition the set of children into two sets corresponding to the promoted children. The two promoted children and their corresponding partitions will be used to create the two nodes that will replace the exceeding node. There are some pre-defined implementations for promotion and partition functions, and the user can implement new ones at will. The yet-to-be-implemented MTreeDict class wraps an MTree and a mapping structure, so it helps on keeping track of objects already added. The data objects work as keys for associated value objects, so MTreeDict can be used as a dictionary. The yet-to-be-implemented MTreeMultiDict works like the MTreeDict, but allows associating more than one value object to the same key. The number of values associated to the same key are taken into account when performing queries constrained by the number of results. See the README file specific for each language: * README-py for Python. * README-cpp for C++. For more information see: http://en.wikipedia.org/wiki/M-tree http://www.vldb.org/conf/1997/P426.PDF
As I can see splitting the nodes in a bottom-up fashion is realized by throwing and catching some data. But it's said that throwing-catching should only be applied for exceptional situation, errors, not in the usual program execution, because that mechanism is very slow and just breakes the rule of good programming style.
So why not add in every node a reference to it's parent? Is it the way to avoid the stack overflow due to a recursion?
Hi,
this M-Tree Java implementation works as it outputs the correct nearest neighbors in the correct order. Thank you for open sourcing this! But when I count the number of distance calculations I always get a larger number than there are items in the M-Tree :/ Am I doing something wrong?
flo
I am trying to run this project with new GCC version 11.2.0 and facing an issue with dynamic exception specifications error:
mtree.h:669:85: error: ISO C++17 does not allow dynamic exception specifications
669 | void addData(const Data& data, double distance, const mtree* mtree) throw(SplitNodeReplacement) {
| ^~~~~
mtree.h:723:98: error: ISO C++17 does not allow dynamic exception specifications
723 | virtual void doRemoveData(const Data& data, double distance, const mtree* mtree) throw (DataNotFound) = 0;
...
Removing the throw specification completely will resolve the issue but is there any other way to fix this problem? I have tried to change the default configuration settings(i.e. "cppStandard": "c++17" to "cppStandard": "c++11") but still getting the same issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.