jubatus / jubatus_core Goto Github PK

View Code? Open in Web Editor NEW

20.0 26.0 29.0 9.8 MB

Jubatus algorithm component

License: GNU Lesser General Public License v2.1

Makefile 0.01% C++ 90.61% Python 8.86% C 0.19% Ruby 0.21% Shell 0.12%

jubatus_core's Introduction

jubatus_core

jubatus_core is the core library of Jubatus.

See http://jubat.us/ for details of Jubatus.

How to install

We officially support Red Hat Enterprise Linux (RHEL) 6.2 or later (64-bit) and Ubuntu Server 14.04 LTS / 16.04 LTS (64-bit). On supported systems, you can install all components of Jubatus using binary packages.

If you have already installed Jubatus 0.6.0 or later, you can already use jubatus_core. QuickStart describes how to install Jubatus.

If you do not want to install whole Jubatus, you can install jubatus_core only. Before installation, you should install msgpack and oniguruma (oniguruma is optional). Then type as following:

wget -O jubatus_core-master.tar.gz https://github.com/jubatus/jubatus_core/archive/master.tar.gz
tar xf jubatus_core-master.tar.gz
cd jubatus_core-master
./waf configure --prefix=<prefix>
./waf
./waf --checkall
./waf install

If you do not need oniguruma, type

./waf configure --regexp-library=none --prefix=<prefix>

instead of

./waf configure --prefix=<prefix>

If you want to use re2 instead of oniguruma, add --regexp-library=re2 to ./waf configure.

License

LGPL 2.1

Third-party library included in jubatus_core

Jubatus source tree includes following third-party library.

Eigen (mainly under MPL2 License, while some codes are under LPGL2.1 or LGPL2.1+)

A fork of pficommon (placed under jubatus_core's jubatus/core/util/. New BSD License)

Update history

Update history can be found from ChangeLog.

Contributors

Contributors are listed at https://github.com/jubatus/jubatus/contributors and https://github.com/jubatus/jubatus_core/contributors.

jubatus_core's People

Contributors

Stargazers

Watchers

jubatus_core's Issues

Invalid replace string is not checked in re2_filter

We need to replacing strings in re2_filter. When a user give an invalid replacing string to re2_fitler, it reports no errors.

In GlobalReplace method in re2 (https://code.google.com/p/re2/source/browse/re2/re2.cc#374), errors caused in Rewrite method is ignored at https://code.google.com/p/re2/source/browse/re2/re2.cc#399. To check validity of replacing string, we need to call CheckRewriteString https://code.google.com/p/re2/source/browse/re2/re2.cc#907 in the constructor.

anomaly: avoid LRD to become inf when k points with same fv are added

Offline discussion with @hido:
Improve LOF to reject adding data when there are k-1 points whose distance to the data being added is 0.
This helps avoiding LRD to become inf (so that LOF scores of its neighbors won't go inf.)

I'm planning to add new config entry named ignore_kth_same_point (bool) to switch the behavior.

core: Remove unused tools

jubadoc, packaging are unused for jubatus_core.

column table does not touch unlearner on MIX

The cluster has 2 servers and is configured with unleaner enabled (max_size = 3).
Now think of the following scenario:

Train Server 1 with 3 rows: [user1, user2, user3]
Train Server 2 with 3 rows: [user4, user5, user6]
do_mix

Then both servers have 6 rows: [user1, user2, user3, user4, user5, user6]

mixable_versioned_table should touch the unlearner when adding new rows.

jubaclustering: MIX does not work correctly

I tested jubaclustering in distributed-mode.

My test is the following scenario:

push to jubaclustering_1
Wait for MIX
get_core_members to jubaclustering_2

Expected: get_core_members returns the data pushed to jubaclustering_1.
Actual: get_core_members returns empty list.

log of jubaclassifier_1

version was not incremented.

2014-10-08 11:06:50,840 29431 INFO  [linear_mixer.cpp:612] put_diff with 1285 bytes finished my model is still up to date. versions [0th]
2014-10-08 11:06:50,841 29432 INFO  [linear_mixer.cpp:513] success to put_diff to [xxx.xxx.x.x:9000, xxx.xxx.x.x:9001]
2014-10-08 11:06:50,841 29432 INFO  [linear_mixer.cpp:525] mixed with 2 servers in 0.224686 secs, 1285 bytes (serialized data) has been put.
2014-10-08 11:06:50,841 29432 INFO  [linear_mixer.cpp:382] .... mix done. versions[0th]
2014-10-08 11:13:41,406 29432 INFO  [linear_mixer.cpp:373] got ZooKeeper lock, starting mix
2014-10-08 11:13:41,583 29432 INFO  [linear_mixer.cpp:477] success to get_diff from [xxx.xxx.x.x:9000, xxx.xxx.x.x:9001]
2014-10-08 11:13:41,768 29431 INFO  [linear_mixer.cpp:612] put_diff with 1486 bytes finished my model is still up to date. versions [0th]
2014-10-08 11:13:41,768 29432 INFO  [linear_mixer.cpp:513] success to put_diff to [xxx.xxx.x.x:9000, xxx.xxx.x.x:9001]
2014-10-08 11:13:41,768 29432 INFO  [linear_mixer.cpp:525] mixed with 2 servers in 0.362123 secs, 1486 bytes (serialized data) has been put.
2014-10-08 11:13:41,768 29432 INFO  [linear_mixer.cpp:382] .... mix done. versions[0th]

config

{
  "converter" : {
    "string_filter_types" : {},
    "string_filter_rules" : [],
    "num_filter_types" : {},
    "num_filter_rules" : [],
    "string_types" : {},
    "string_rules" : [
      { "key" : "*", "type" : "str", "sample_weight" : "bin", "global_weight" : "bin" }
    ],
    "num_types" : {},
    "num_rules" : [
      { "key" : "*", "type" : "num" }
    ]
  },
  "parameter" : {
    "k" : 3,
    "compressor_method" : "compressive_gmm",
    "bucket_size" : 10,
    "compressed_bucket_size" : 5,
    "bicriteria_base_size" : 1,
    "bucket_length" : 2,
    "forgetting_factor" : 0,
    "forgetting_threshold" : 0.5
  },
  "method" : "gmm"
}

combination feature crash when sfv_t is empty

convert_combinations doesn't consider the case that the length of a given sfv_t is 0. This causes hang-up.

https://github.com/jubatus/jubatus_core/blob/0.1.1/jubatus/core/fv_converter/datum_to_fv_converter.cpp#L552

This case can be reproduced by sending an empty Datum to any API.

Inverted index can be more faster

The current implementation of inverted index is slow. I think it has three problems.

unordered_map of unordered_map is too large. Use more light-weight data structure, such as an ordered vector of key-value pairs (assoc vector)
Don't sort all scores. Use partial sort or heap to reduce calculation time.
Using unordered_map to store document scores is slow. Use an array.

Nearest Neighbor: `get_all_rows` and `clear_row` feature

Recommender provides get_all_rows and clear_row whereas Nearest Neighbor does not.
I want them in NN.

Unfriendly classifier message for invalid config

When I specified:

{
  "converter": { … snip …},
  "method": "AROW"
}

... then I got the message: Object is expected, but Null is given.

The correct config was as follows:

{
  "converter": { … snip …},
  "method": "AROW"
  "parameter": {
    "regularization_weight": 1.0
  }
}

(parameter block was missing in the first example)

It's so hard to guess the reason.

I think it's better to take care of these cases around here:
https://github.com/jubatus/jubatus_core/blob/master/jubatus/core/classifier/classifier_factory.cpp#L77

bandit: exp3 gamma parameter cannot be zero

Refs jubatus/website#209

As the original paper says, gamma_ parameter cannot be zero. Otherwise rewards will always become zero.

https://github.com/jubatus/jubatus_core/blob/0.2.0/jubatus/core/bandit/exp3.cpp#L32-L35
https://github.com/jubatus/jubatus_core/blob/0.2.0/jubatus/core/bandit/exp3.cpp#L91-L92

jsonconfig generates many compiler warning

When running ./waf --checkall, I see many ‘value’ may be used uninitialized in this function [-Wuninitialized] warnings near jubatus/core/common/jsonconfig/cast.hpp. Why?

In file included from ../jubatus/core/unlearner/../common/jsonconfig.hpp:22:0,
                 from ../jubatus/core/unlearner/unlearner_factory.hpp:21,
                 from ../jubatus/core/unlearner/unlearner_factory.cpp:17:
/home/jubatus/Development/jubatus_core/jubatus/util/text/json/../../data/optional.h: In function ‘void jubatus::core::common::jsonconfig::serialize(jubatus::core::common::jsonconfig::json_config_iarchive_cast&, jubatus::util::data::serialization::named_value<jubatus::util::data::optional<T> >&) [with T = long int]’:
/home/jubatus/Development/jubatus_core/jubatus/util/text/json/../../data/optional.h:83:7: warning: ‘value’ may be used uninitialized in this function [-Wuninitialized]
../jubatus/core/unlearner/../common/./jsonconfig/cast.hpp:247:7: note: ‘value’ was declared here

weight_manager problems in recommender, NN, anomaly and clustering

In recommender and anomaly, weight_manager is not saved to the model file.
In recommender, NN, anomaly and clustering, weight_manager is not MIXed between servers.

crc32 isn't used by jubatus_core itself

I found that jubatus_core doesn't use crc32 and I don't see any necessity of providing crc32 module as part of jubatus_core. crc32 module is used in the server when it saves models, so I think the server should have it.

update README

README should be written for jubatus_core. (currently the same one as jubatus repos)

core/column/table should move into core/storage

table is a kind of storage!

string_filter_rules: regexp (oniguruma) does not support replacing with capture group

When compiled with oniguruma, regexp replace in string_filter_types does not support replacement strings with capture group back-reference. For example, the following configuration:

"string_filter_types" : {
  "juicer":  {"method": "regexp", "pattern": "(apple|banana|orange)", "replace": "\\1 juice"}
},

does not work as expected. This works with re2.

We're currently just replacing the matched string with the replacement string. We should handle these back references by oufselfves (oniguruma does not provide this feature).

https://github.com/jubatus/jubatus_core/blob/0.0.3/jubatus/core/fv_converter/onig_filter.cpp#L56

GMM Test fails when debug is enabled

# ./waf configure --enable-debug
(snip)
# ./waf --check
(snip)
  tests that fail 1/137 
    /home/subaru/jubatus/jubatus_core/build/jubatus/core/clustering/clustering_gmm_test 
Running main() from gtest_main.cc
[==========] Running 3 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 3 tests from gmm_test
[ RUN      ] gmm_test.centers_and_covs
[       OK ] gmm_test.centers_and_covs (187 ms)
[ RUN      ] gmm_test.nearest_center
[       OK ] gmm_test.nearest_center (19 ms)
[ RUN      ] gmm_test.clear_with_empty_list

/usr/include/c++/4.8/debug/vector:353:error: attempt to subscript container 
    with out-of-bounds index 0, but container only holds 0 elements.

Objects involved in the operation:
sequence "this" @ 0x0x249c840 {
  type = NSt7__debug6vectorIN5Eigen12SparseVectorIdLi0EiEESaIS3_EEE;
}

test failed

Anomaly: LRU unlearner does not delete id in order of registration

The following scenario in jubaanomaly (light_lof) with lru_unlearner (max_size = 5).

add => Id 0 was added
add => Id 1 was added
add => Id 2 was added
add => Id 3 was added
add => Id 4 was added
add => Id 5 was added
get_all_rows

Expected: get_all_rows returns ['1', '2', '3', '4', '5']
Actual: get_all_rows returns ['0', '1', '3', '4', '5'] ... Id 2 was deleted

light_lof currently touches the ids that are reverse k-nearest neighbors.
https://github.com/jubatus/jubatus_core/blob/0.0.7/jubatus/core/anomaly/light_lof.cpp#L178-L181

Add bias term by default

As reported in the mailing list and Casual Talk event, some users want this.
We should discuss we should support it or not.

bandit: Several calculations in exp3 look wrong.

These two calculations below look wrong.

https://github.com/jubatus/jubatus_core/blob/master/jubatus/core/bandit/exp3.cpp#L56
https://github.com/jubatus/jubatus_core/blob/master/jubatus/core/bandit/exp3.cpp#L92
should be weights[i] = (1.0 - gamma_) * weights[i] / total_weight + gamma_ / n
should be reward / weights[i] * gamma_ / arms.size())

Predefined normalization in fv_converter

Based on the discussion on jubatus/jubatus#255, we decided to implement a normalization method with predefined max/min to have [0, 1]-scaled feature value.
There can also be another option (possibly truncate="True"/"False") to choose whether or not out-of-range values are truncated into exactly [0, 1].

clustering: meaningless call of get_all?

jubatus_core/jubatus/core/clustering/storage.cpp

Line 63 in 04692c6

wplist all = get_all();

The result from get_all is unused.

Consider to remove a default constructor which initialize a shared_ptr as null.

This default constructor

jubatus_core/jubatus/core/framework/mixable_helper.hpp

Line 46 in 076e08a

linear_mixable_helper() {

initializes model_ as null. This is error-prone. An example: jubatus/jubatus#888

typo in column table argument

The variable name colum_id should be column_id.

https://github.com/jubatus/jubatus_core/blob/0.0.7/jubatus/core/table/column/column_table.hpp#L139-L148

Wrong exception message in stat::moment

s/min/moment/

https://github.com/jubatus/jubatus_core/blob/0.0.2/jubatus/core/stat/stat.cpp#L154

clustering does not implement clear

https://github.com/jubatus/jubatus_core/blob/master/jubatus/core/driver/clustering.cpp#L181

Add driver test for standalone clustering

This is a continued issue taken from jubatus/jubatus/issues/654.
The solution should be a modified version of jubatus/jubatus/pull/736.

random_unleaner should take care of entries deleted by user

Think of the following scenario in classifier:

Set max_size to 2.
Call train: label_1
Call train: label_2
Call delete_label: label_2
Call train: label_3
Call get_labels

Expected: get_labels returns [label_1, label_3]
Actual: get_labels returns [label_3] only.

random_unlearner currently does not remove entry removed by user, via delete_label (or remove_row in recommender, etc.)

weight_manager is not saved in NN and clustering

Currently we are not packing weight_manager in NN and clustering.

We can easily fix this, however this breaks the model file compatibility.

jubabandit: trial_count should be incremented when arm selection

Several bandit algorithms currently increment trial_count when reward is registered.
https://github.com/jubatus/jubatus_core/blob/master/jubatus/core/bandit/summation_storage.cpp#L67

trial_count should be incremented when an arm is selected.

Drivers should be thread safe.

for concurrency of machine learning, Jubatus server holds giant lock for each model.
But giant lock is harmful for performance on multi-core processors.

Drivers should be thread safe, and server should not hold giant lock for each model.

improper error message in gaussian_normalization_filter

https://github.com/jubatus/jubatus_core/blob/0.0.6/jubatus/core/fv_converter/num_filter_impl.hpp#L80

"Variance must be non-negative" should instead be "standard deviation must be non-negative".

Bandit: "clear" API should clear arm_ids_

https://github.com/jubatus/jubatus_core/blob/0.1.1/jubatus/core/bandit/summation_storage.cpp#L157

weight_manager cannot be MIXed when push_mixer is used

When using push_mixer MIX strategy, weight_manager is not MIXed between servers.
As a result, IDF values may become more inaccurate than when using linear_mixer.

"retain_projection" parameter in recommender euclid_lsh seems ineffective

Even when retain_projection is set to true, it dynamically calculates projection vectors:

https://github.com/jubatus/jubatus_core/blob/0.0.7/jubatus/core/recommender/euclid_lsh.cpp#L60-L76

These seems a dead code; after removing these lines it compiles successfully.

https://github.com/jubatus/jubatus_core/blob/0.0.7/jubatus/core/recommender/euclid_lsh.cpp#L230-L257

unused code in anomaly light_lof

The variable table seems unused. We should remove it.

https://github.com/jubatus/jubatus_core/blob/0.0.7/jubatus/core/anomaly/light_lof.cpp#L99-L102

clustering: kmeans SegV when get_nearest_center is called before clustering is performed

jubaclustering SegV when get_nearest_center is called before clustering is performed.
This only happens if the "kmeans" method is specified.

Related to: #95 (Fix for "gmm" SegV issue)

Support unlearning for `inverted_index`

inverted_index is stable and easy to use. But, it doesn't support unlearning. Users who need unlearning cannot choose inverted_index.

I have two ideas to achieve it:

implement inverted_index nearest neighbor module, and implement unlearning
implement unlearning in inverted_index recommender

The former is preferable.

column table does not support squashing rows on Push MIX

Column table does not support propagating removed rows via push_mixer.

parameter validation for burst detection

Input parameters must be validated.
Valid ranges are documented in http://jubat.us/en/api_burst.html.

`lof` dose not support unlearning

lof doesn't support unlearning, but lof_config contains its configuration. Remove it.
https://github.com/jubatus/jubatus_core/blob/master/jubatus/core/anomaly/anomaly_factory.cpp#L44

Nearnest Neighbor does not work with idf

Migrating issue jubatus/jubatus#869.

remove `BURST_DEBUG`

We shouldn't use this compile option.
Instead, we should use NDEBUG flag for debugging.

nearest_neighbor does not call can_touch before touch

Note: As nearest_neighbor itself does not support unlearner so far, this issue does not affect the functionality of current Jubatus. Only people who plans to directly use the driver layer of nearest_neighbor engine with unlearner enabled will be affected.

In nearest_neighbor, touch is called without calling can_touch.

https://github.com/jubatus/jubatus_core/blob/0.1.0/jubatus/core/driver/nearest_neighbor.cpp#L61

As touch operation may fail (and in such case the model must not be updated), we should check if unlearner accepts the given ID, before actually updating the model.

We'll implement bool is_commutative() API for each combination function. When is_commutative() returns true, full matrix of feature combination will be generated for the combination feature.

jubatus / jubatus_core Goto Github PK

jubatus_core's Introduction

jubatus_core

How to install

License

Third-party library included in jubatus_core

Update history

Contributors

jubatus_core's People

Contributors

Stargazers

Watchers

Forkers

jubatus_core's Issues

log of jubaclassifier_1

config

Recommend Projects

Recommend Topics

Recommend Org