Giter Site home page Giter Site logo

jubatus_core's Introduction

jubatus_core

https://api.travis-ci.org/jubatus/jubatus_core.svg?branch=master

jubatus_core is the core library of Jubatus.

See http://jubat.us/ for details of Jubatus.

How to install

We officially support Red Hat Enterprise Linux (RHEL) 6.2 or later (64-bit) and Ubuntu Server 14.04 LTS / 16.04 LTS (64-bit). On supported systems, you can install all components of Jubatus using binary packages.

If you have already installed Jubatus 0.6.0 or later, you can already use jubatus_core. QuickStart describes how to install Jubatus.

If you do not want to install whole Jubatus, you can install jubatus_core only. Before installation, you should install msgpack and oniguruma (oniguruma is optional). Then type as following:

wget -O jubatus_core-master.tar.gz https://github.com/jubatus/jubatus_core/archive/master.tar.gz
tar xf jubatus_core-master.tar.gz
cd jubatus_core-master
./waf configure --prefix=<prefix>
./waf
./waf --checkall
./waf install

If you do not need oniguruma, type

./waf configure --regexp-library=none --prefix=<prefix>

instead of

./waf configure --prefix=<prefix>

If you want to use re2 instead of oniguruma, add --regexp-library=re2 to ./waf configure.

License

LGPL 2.1

Third-party library included in jubatus_core

Jubatus source tree includes following third-party library.

  • Eigen (mainly under MPL2 License, while some codes are under LPGL2.1 or LGPL2.1+)
  • A fork of pficommon (placed under jubatus_core's jubatus/core/util/. New BSD License)

Update history

Update history can be found from ChangeLog.

Contributors

Contributors are listed at https://github.com/jubatus/jubatus/contributors and https://github.com/jubatus/jubatus_core/contributors.

jubatus_core's People

Contributors

beam2d avatar crossquare avatar crow-misia avatar gintenlabo avatar gwtnb avatar hido avatar hillbig avatar imaimai1125 avatar jkomiyama avatar kazuki avatar kmaehashi avatar kuenishi avatar kumagi avatar luomin avatar odasatoshi avatar pepshiso avatar rimms avatar sakuraikaito avatar shiodat avatar suma avatar t-abe avatar tkrudagawa avatar unnonouno avatar y-oda-oni-juba avatar y-tag avatar yamori813 avatar yukimori avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jubatus_core's Issues

Invalid replace string is not checked in re2_filter

We need to replacing strings in re2_filter. When a user give an invalid replacing string to re2_fitler, it reports no errors.

In GlobalReplace method in re2 (https://code.google.com/p/re2/source/browse/re2/re2.cc#374), errors caused in Rewrite method is ignored at https://code.google.com/p/re2/source/browse/re2/re2.cc#399. To check validity of replacing string, we need to call CheckRewriteString https://code.google.com/p/re2/source/browse/re2/re2.cc#907 in the constructor.

anomaly: avoid LRD to become inf when k points with same fv are added

Offline discussion with @hido:
Improve LOF to reject adding data when there are k-1 points whose distance to the data being added is 0.
This helps avoiding LRD to become inf (so that LOF scores of its neighbors won't go inf.)

I'm planning to add new config entry named ignore_kth_same_point (bool) to switch the behavior.

column table does not touch unlearner on MIX

The cluster has 2 servers and is configured with unleaner enabled (max_size = 3).
Now think of the following scenario:

  1. Train Server 1 with 3 rows: [user1, user2, user3]
  2. Train Server 2 with 3 rows: [user4, user5, user6]
  3. do_mix

Then both servers have 6 rows: [user1, user2, user3, user4, user5, user6]

mixable_versioned_table should touch the unlearner when adding new rows.

jubaclustering: MIX does not work correctly

I tested jubaclustering in distributed-mode.

My test is the following scenario:

  • push to jubaclustering_1
  • Wait for MIX
  • get_core_members to jubaclustering_2

Expected: get_core_members returns the data pushed to jubaclustering_1.
Actual: get_core_members returns empty list.

log of jubaclassifier_1

version was not incremented.

2014-10-08 11:06:50,840 29431 INFO  [linear_mixer.cpp:612] put_diff with 1285 bytes finished my model is still up to date. versions [0th]
2014-10-08 11:06:50,841 29432 INFO  [linear_mixer.cpp:513] success to put_diff to [xxx.xxx.x.x:9000, xxx.xxx.x.x:9001]
2014-10-08 11:06:50,841 29432 INFO  [linear_mixer.cpp:525] mixed with 2 servers in 0.224686 secs, 1285 bytes (serialized data) has been put.
2014-10-08 11:06:50,841 29432 INFO  [linear_mixer.cpp:382] .... mix done. versions[0th]
2014-10-08 11:13:41,406 29432 INFO  [linear_mixer.cpp:373] got ZooKeeper lock, starting mix
2014-10-08 11:13:41,583 29432 INFO  [linear_mixer.cpp:477] success to get_diff from [xxx.xxx.x.x:9000, xxx.xxx.x.x:9001]
2014-10-08 11:13:41,768 29431 INFO  [linear_mixer.cpp:612] put_diff with 1486 bytes finished my model is still up to date. versions [0th]
2014-10-08 11:13:41,768 29432 INFO  [linear_mixer.cpp:513] success to put_diff to [xxx.xxx.x.x:9000, xxx.xxx.x.x:9001]
2014-10-08 11:13:41,768 29432 INFO  [linear_mixer.cpp:525] mixed with 2 servers in 0.362123 secs, 1486 bytes (serialized data) has been put.
2014-10-08 11:13:41,768 29432 INFO  [linear_mixer.cpp:382] .... mix done. versions[0th]

config

{
  "converter" : {
    "string_filter_types" : {},
    "string_filter_rules" : [],
    "num_filter_types" : {},
    "num_filter_rules" : [],
    "string_types" : {},
    "string_rules" : [
      { "key" : "*", "type" : "str", "sample_weight" : "bin", "global_weight" : "bin" }
    ],
    "num_types" : {},
    "num_rules" : [
      { "key" : "*", "type" : "num" }
    ]
  },
  "parameter" : {
    "k" : 3,
    "compressor_method" : "compressive_gmm",
    "bucket_size" : 10,
    "compressed_bucket_size" : 5,
    "bicriteria_base_size" : 1,
    "bucket_length" : 2,
    "forgetting_factor" : 0,
    "forgetting_threshold" : 0.5
  },
  "method" : "gmm"
}

Inverted index can be more faster

The current implementation of inverted index is slow. I think it has three problems.

  • unordered_map of unordered_map is too large. Use more light-weight data structure, such as an ordered vector of key-value pairs (assoc vector)
  • Don't sort all scores. Use partial sort or heap to reduce calculation time.
  • Using unordered_map to store document scores is slow. Use an array.

Unfriendly classifier message for invalid config

When I specified:

{
  "converter": { … snip …},
  "method": "AROW"
}

... then I got the message: Object is expected, but Null is given.

The correct config was as follows:

{
  "converter": { … snip …},
  "method": "AROW"
  "parameter": {
    "regularization_weight": 1.0
  }
}

(parameter block was missing in the first example)

It's so hard to guess the reason.

I think it's better to take care of these cases around here:
https://github.com/jubatus/jubatus_core/blob/master/jubatus/core/classifier/classifier_factory.cpp#L77

jsonconfig generates many compiler warning

When running ./waf --checkall, I see many ‘value’ may be used uninitialized in this function [-Wuninitialized] warnings near jubatus/core/common/jsonconfig/cast.hpp. Why?

In file included from ../jubatus/core/unlearner/../common/jsonconfig.hpp:22:0,
                 from ../jubatus/core/unlearner/unlearner_factory.hpp:21,
                 from ../jubatus/core/unlearner/unlearner_factory.cpp:17:
/home/jubatus/Development/jubatus_core/jubatus/util/text/json/../../data/optional.h: In function ‘void jubatus::core::common::jsonconfig::serialize(jubatus::core::common::jsonconfig::json_config_iarchive_cast&, jubatus::util::data::serialization::named_value<jubatus::util::data::optional<T> >&) [with T = long int]’:
/home/jubatus/Development/jubatus_core/jubatus/util/text/json/../../data/optional.h:83:7: warning: ‘value’ may be used uninitialized in this function [-Wuninitialized]
../jubatus/core/unlearner/../common/./jsonconfig/cast.hpp:247:7: note: ‘value’ was declared here

crc32 isn't used by jubatus_core itself

I found that jubatus_core doesn't use crc32 and I don't see any necessity of providing crc32 module as part of jubatus_core. crc32 module is used in the server when it saves models, so I think the server should have it.

update README

README should be written for jubatus_core. (currently the same one as jubatus repos)

string_filter_rules: regexp (oniguruma) does not support replacing with capture group

When compiled with oniguruma, regexp replace in string_filter_types does not support replacement strings with capture group back-reference. For example, the following configuration:

"string_filter_types" : {
  "juicer":  {"method": "regexp", "pattern": "(apple|banana|orange)", "replace": "\\1 juice"}
},

does not work as expected. This works with re2.

We're currently just replacing the matched string with the replacement string. We should handle these back references by oufselfves (oniguruma does not provide this feature).

https://github.com/jubatus/jubatus_core/blob/0.0.3/jubatus/core/fv_converter/onig_filter.cpp#L56

GMM Test fails when debug is enabled

# ./waf configure --enable-debug
(snip)
# ./waf --check
(snip)
  tests that fail 1/137 
    /home/subaru/jubatus/jubatus_core/build/jubatus/core/clustering/clustering_gmm_test 
Running main() from gtest_main.cc
[==========] Running 3 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 3 tests from gmm_test
[ RUN      ] gmm_test.centers_and_covs
[       OK ] gmm_test.centers_and_covs (187 ms)
[ RUN      ] gmm_test.nearest_center
[       OK ] gmm_test.nearest_center (19 ms)
[ RUN      ] gmm_test.clear_with_empty_list

/usr/include/c++/4.8/debug/vector:353:error: attempt to subscript container 
    with out-of-bounds index 0, but container only holds 0 elements.

Objects involved in the operation:
sequence "this" @ 0x0x249c840 {
  type = NSt7__debug6vectorIN5Eigen12SparseVectorIdLi0EiEESaIS3_EEE;
}

test failed

Anomaly: LRU unlearner does not delete id in order of registration

The following scenario in jubaanomaly (light_lof) with lru_unlearner (max_size = 5).

  1. add => Id 0 was added
  2. add => Id 1 was added
  3. add => Id 2 was added
  4. add => Id 3 was added
  5. add => Id 4 was added
  6. add => Id 5 was added
  7. get_all_rows

Expected: get_all_rows returns ['1', '2', '3', '4', '5']
Actual: get_all_rows returns ['0', '1', '3', '4', '5'] ... Id 2 was deleted

light_lof currently touches the ids that are reverse k-nearest neighbors.
https://github.com/jubatus/jubatus_core/blob/0.0.7/jubatus/core/anomaly/light_lof.cpp#L178-L181

Add bias term by default

As reported in the mailing list and Casual Talk event, some users want this.
We should discuss we should support it or not.

Predefined normalization in fv_converter

Based on the discussion on jubatus/jubatus#255, we decided to implement a normalization method with predefined max/min to have [0, 1]-scaled feature value.
There can also be another option (possibly truncate="True"/"False") to choose whether or not out-of-range values are truncated into exactly [0, 1].

random_unleaner should take care of entries deleted by user

Think of the following scenario in classifier:

  • Set max_size to 2.
  • Call train: label_1
  • Call train: label_2
  • Call delete_label: label_2
  • Call train: label_3
  • Call get_labels

Expected: get_labels returns [label_1, label_3]
Actual: get_labels returns [label_3] only.

random_unlearner currently does not remove entry removed by user, via delete_label (or remove_row in recommender, etc.)

Drivers should be thread safe.

for concurrency of machine learning, Jubatus server holds giant lock for each model.
But giant lock is harmful for performance on multi-core processors.

Drivers should be thread safe, and server should not hold giant lock for each model.

Support unlearning for `inverted_index`

inverted_index is stable and easy to use. But, it doesn't support unlearning. Users who need unlearning cannot choose inverted_index.

I have two ideas to achieve it:

  • implement inverted_index nearest neighbor module, and implement unlearning
  • implement unlearning in inverted_index recommender

The former is preferable.

remove `BURST_DEBUG`

We shouldn't use this compile option.
Instead, we should use NDEBUG flag for debugging.

nearest_neighbor does not call can_touch before touch

Note: As nearest_neighbor itself does not support unlearner so far, this issue does not affect the functionality of current Jubatus. Only people who plans to directly use the driver layer of nearest_neighbor engine with unlearner enabled will be affected.

In nearest_neighbor, touch is called without calling can_touch.

https://github.com/jubatus/jubatus_core/blob/0.1.0/jubatus/core/driver/nearest_neighbor.cpp#L61

As touch operation may fail (and in such case the model must not be updated), we should check if unlearner accepts the given ID, before actually updating the model.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.