guokr / simbase Goto Github PK

A vector similarity database

License: Apache License 2.0

Shell 0.38% Clojure 0.75% Java 98.87%

simbase's Introduction

Simbase: A vector similarity database

Simbase is a redis-like vector similarity database. You can add, get, delete vectors to/from it, and then retrieve the most similar vectors within one vector set or between two vector sets.

Release

Current version is v0.1.0-beta1.

Build status

Concepts

Simbase use a concept model as below:

                   + - - - +
      +----------->| Basis |<------------------+
      |  belongs   + _ _ _ +      belongs      |
      |                                        |
      |                                        |
+ - - - - - +        source           + - - - - - - - -+ 
| VectorSet |<------------------------| Recommendation |
+ - - - - - +                         + - - - - - - - -+
      ^              target                    |
      |________________________________________|

Vector set: a set of vectors
Basis: the basis for vectors, vectors in one vector set have same basis
Recommendation: a one-direction binary relationship between two vector sets which have the same basis

A real example follow the model below:

     + - - - - - +                 + - - - - - - - -+ 
+--->|  Articles |<----------------|  User Profiles |
|    + - - - - - +                 + - - - - - - - -+
|          |
+----------+

This graph shows

recommend article by article (recommend from article to article)
recommend article by user profile (recommend from user profile to article)

How to build and start

To build the project, you need install leiningen first, and then

cd SIMBASE_HOME

lein uberjar

After the uberjar is created, you can start the system

cd SIMBASE_HOME

bin/start

How to connect to Simbase

You can use redis-cli directly for administration tasks.

Or you can use redis client bindings in different language directly in a programming way.

Python example

import redis

dest = redis.Redis(host='localhost', port=7654)
dest.execute_command('bmk', 'ba', 'a', 'b', 'c')
dest.execute_command('vmk', 'ba', 'va')
dest.execute_command('rmk', 'va', 'va', 'cosinesq')

Node.js example

var redis = require("redis"), client = redis.createClient(7654, 'localhost');

client.send_command('bmk', ['ba', 'a', 'b', 'c'])
client.send_command('vmk', ['ba', 'va'])
client.send_command('rmk', ['va', 'va', 'cosinesq'])

A general application case

For example, we need to recommend articles to users, we may follow below steps:

Setup

> bmk b2048 t1 t2 t3 ... t2047 t2048
> vmk b2048 article
> vmk b2048 userprofile
> rmk userprofile article cosinesq

Fill data

> vadd article 1 0.11 0.112 0.1123...
> vadd article 2 0.21 0.212 0.2123...
...    

> vadd userprofile 1 0.11 0.112 0.1123...
> vadd userprofile 2 0.21 0.212 0.2123...
...

Query

> rrec userprofile 2 article

All commands are explained in next section.

Core commands

Then you can use redis-cli to connect to simbase directly

Basis related

blist

blist

List all basis in system
bmk basisname components...

bmk b512 universe time space human animal plant...

Create a basis
brev basisname components...

brev b512 plant animal human space time universe...

Revise a basis

Vector set related

vlist basisname

vlist b512

List all vector set with one basis
vmk basisname vecsetname

vmk b512 article

Create a vector set
vget vecsetname vecid

vget article 12345678

Get the vector for the article with id 12345678
vadd vecsetname vecid components...

vadd article 12345678 0.1 0.12 0.123 0.1234 0.12345 0.123456...

add the value for the article vector with id 12345678
vset vecsetname vecid components...

vset article 12345678 0.1 0.12 0.123 0.1234 0.12345 0.123456...

set the value for the article vector with id 12345678
vacc vecsetname vecid components...

vacc article 12345678 0.1 0.12 0.123 0.1234 0.12345 0.123456...

accumulate the value for the article vector with id 12345678
vrem vecsetname vecid

vrem article 12345678

remove the vector with id 12345678 from article vector set

Recommendation related

rlist vecsetname

rlist article

List all recommendation targets with the inputed vecset as source
rmk vecsetname1 vecsetname2 funcscore

rmk userprofile article cosinesq

Create a recommendation to article by userprofile and it use cosinesq as score function. Currently score functions you can choice are: 'cosinesq' and 'jensenshannon'
rrec vecsetname1 vecid vecsetname2

rrec userprofile 87654321 article

Recommend articles for user 87654321

Limitations

Assumptions on vectors

Although Simbase can store arbitrary vectors, but score functions may apply some constraints on vectors.

For example, if you adopt "jensenshannon" as your score function, you should assure your vector is a probability distribution, i.e. the sum of all components equals to one.

Performance consideration

The write operation is handled in a single thread per basis, and comparison between any two vectors is needed, so the write operation is scaled at O(n).

We had a non-final performance test for the dense vectors on an i7-cpu Macbook, it can easily handle 100k 1k-dimensional vectors with each write operation in under 0.14 sec; and if the linear scale ratio can hold, it means Simbase can handle 700k dense vectors with each write operation in under 1 sec.

Since the data is all in memory, the read operation is pretty fast.

We are still in the process of tuning the performance of the sparse vectors.

Licenses

Simbase is dual licensed under the Apache License 2.0 and Eclipse Public License 1.0. Simbase is free for commercial use and distribution under the terms of either license.

Special thanks

Special thanks for Feng Sheng, we borrowed lots of code from his great project http-kit ( https://github.com/http-kit/http-kit/ ).

Also thanks for Kunwei Zhang from Tsinghua Univ. for his smart idea.

Contributors

Mingli Yuan ( https://github.com/mountain )
Wanjian Wu ( https://github.com/jseagull )
Yang Zhang ( https://github.com/zmouren )
Jianjiang Zhu ( https://github.com/zjjott )
Jiacai Liu ( https://github.com/jiacai2050 )

simbase's People

Contributors

Stargazers

Watchers

simbase's Issues

Enhence Info command for monitoring

Most of redis monitoring tools use Info command to get the execution status data, we need implement same mechanism to reuse these monitoring tools.

lein uberjar报错

Compiling simbase
Exception in thread "main" java.io.FileNotFoundException: Could not locate simbase__init.class or simbase.clj on classpath:
at clojure.lang.RT.load(RT.java:443)
at clojure.lang.RT.load(RT.java:411)
at clojure.core$load$fn__5018.invoke(core.clj:5530)
at clojure.core$load.doInvoke(core.clj:5529)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at clojure.core$load_one.invoke(core.clj:5336)
at clojure.core$compile$fn__5023.invoke(core.clj:5541)
at clojure.core$compile.invoke(core.clj:5540)
at user$eval9.invoke(form-init5853325299914114337.clj:1)
at clojure.lang.Compiler.eval(Compiler.java:6619)
at clojure.lang.Compiler.eval(Compiler.java:6609)
at clojure.lang.Compiler.load(Compiler.java:7064)
at clojure.lang.Compiler.loadFile(Compiler.java:7020)
at clojure.main$load_script.invoke(main.clj:294)
at clojure.main$init_opt.invoke(main.clj:299)
at clojure.main$initialize.invoke(main.clj:327)
at clojure.main$null_opt.invoke(main.clj:362)
at clojure.main$main.doInvoke(main.clj:440)
at clojure.lang.RestFn.invoke(RestFn.java:421)
at clojure.lang.Var.invoke(Var.java:419)
at clojure.lang.AFn.applyToHelper(AFn.java:163)
at clojure.lang.Var.applyTo(Var.java:532)
at clojure.main.main(main.java:37)
Compilation failed: Subprocess failed

is "Instant Similarity Query" not supported?

Recommendation query seems to be allowed to the only registered vectors.
I think, when registering new vector (vadd command) all comparison calculation is performed (O(n)), and when querying recommendations, pre-caculated results are outputted (Constant Time Complexity?)
But in my application, user input arbitrary vector to the system for "instant" recommendation results..
Please give me some guides~

Cosine Function

Is it possible to print actual distance between the vectors instead of ranking them ? I mean, when comparing two vectors and the results is 0.3 (30% of similarity). If not, are you going to implement it ?

Support matrix and vector transforming

API still need to be discussed.

why not reuse the memory when set the vectors?

@Override
public void set(int vecid, float[] vector) {
    if (indexer.containsKey(vecid)) {
        float[] old = get(vecid);

        if (lengths.get(vecid) != vector.length) {
            remove(vecid);
            add(vecid, vector);
        } else {
            int cursor = indexer.get(vecid);
            for (float val : vector) {
                data.set(cursor, val);
                cursor++;
            }
        }
        if (listening) {
            for (VectorSetListener l : listeners) {
                l.onVectorSetted(this, vecid, old, vector);
            }
        }
    } else {
        add(vecid, vector);
    }
}

在判断长度不相等之后就立即移除了这个元素，然后又新加了元素。
可是在新长度小于老长度的情况下，这部分空间应该还是可以继续使用的，只需要修改一下length，为什么要这么做呢？

Is Euclidean distance not supported?

I'm very happy to see open source Vector Database!
Simbase is great for me, thanks :D

I have a question (or maybe new feature request..)
Supported similarity(score) functions are "cosinesq" and "jensenshannon"
cosine similarity function does not count vector magnitude..
But in my application, vector magnitude is meaningful for similar vector search.
I want similarity function using "Euclidean distance" to be supported also :D
Give some guides, thanks for your great vector DB :D

how to run simbase?

I installed simbase (not sure if correctly, but there were not errrors). I have ssh connection to the server where I installed simbase, so I can work directly on server machine. But I have no access to root, only can sudo. When I do bin/start or sudo bin/start, it writes command not found. Any ideas?

Help me with my use case plz

Could you please help me with a starter code for my use case)

I want to store in vector similarity db key: sentenceID value: vector. Examples:
id_1 [0.06284283101558685, 0.046207964420318604, 0.0053909290581941605, ...]
id_2 [0.006631242576986551, 0.08234132081270218, -0.0787612572312355, ...]

And then I want n top similar vectors' IDs to the given vector.

Support clustering of vectors

When dealing with the scenario of personalized recommendation, the user profile vector set usually are very large, for example ~10m vectors or even above. We take 10m vectors as a baseline, because it still possible to store all the 10m data into one physical machine.

10m vectors * 2048 dimensions * 4 byte float = 80 G memory

Current solution does not fit into the level, because write latency would be ~30s which is not acceptable.

One idea is that: we do not recommend for a single users, but for a cluster of similar users.

Two choices: online KMeans or SimHash?

在Basis类中，get方法和all方法有什么区别？

它们看起来好像差不多？

Eliminate command layer by reflection or codegen

The commands can be inferred from engine. we can introduce some annotation and automatically create command by reflection or codegen.

发现store/Recommendation.java中的代码bug一枚

    if (source != target) {
        scoring.onAttached(target.key());
        target.addListener(scoring);

        if (target.type().equals("dense")) {
            for (int id : source.ids()) {
                scoring.onVectorAdded(target, id, target.get(id));
            }
        } else {
            for (int id : target.ids()) {
                scoring.onVectorAdded(target, id, target._get(id));
            }
        }
    }

for(int id: source.ids()) ==> for(int id: target.ids())

Rewrite the server layer

The server layer is rooted from http-kit, and we need refactor the code to reach a better quality.

A more sophisticated test framework

Currently the very thin test framework are not nature and robust for asynchronous tests, this make the tests difficult to read and also lead to failures on travisCI. Try to solve them.

is only "int" vector id type supported?

I've read some simbase codes (because you
ve requested implementing score function using Euclidean-distance(preserving vector magnitude))
I noticed that vector id type of simbase is java int type.
That means vecid value space is limited to 32 bit integer space..

How about supporting "long" vector id type or "String" vector id type?
String type => can support any kind of id values, but maybe much more memory foot print.
long type => can provide huge integer id value space, more memory than 32bit int type but much less than String type.

In some application case(my case :D), int id type is not enough..
My database key value which will be matched to vector instance(match by vecid) in simbase DB is 64bit long type..
(I'm using Titan Graph Database(It's awesome 👍 ), and graph vertex id is 64 bit long type)

guokr / simbase Goto Github PK

simbase's Introduction

Simbase: A vector similarity database

Release

Build status

Concepts

How to build and start

How to connect to Simbase

A general application case

Core commands

Limitations

Assumptions on vectors

Performance consideration

Licenses

Special thanks

Contributors

simbase's People

Contributors

Stargazers

Watchers

Forkers

simbase's Issues

Recommend Projects

Recommend Topics

Recommend Org