chyikwei / recommend Goto Github PK

View Code? Open in Web Editor NEW

309.0 23.0 117.0 543 KB

recommendation system with python

Python 100.00%

python matrix-factorization recommendation-system

recommend's Introduction

Recommend

Simple recommendatnion system implementation with Python

Current model:

Probabilistic Matrix Factorization
Bayesian Matrix Factorization
Alternating Least Squares with Weighted Lambda Regularization (ALS-WR)

Reference:

"Probabilistic Matrix Factorization", R. Salakhutdinov and A.Mnih., NIPS 2008
"Bayesian Probabilistic Matrix Factorization using MCMC", R. Salakhutdinov and A.Mnih., ICML 2008
Matlab code: http://www.cs.toronto.edu/~rsalakhu/BPMF.html
"Large-scale Parallel Collaborative Filtering for the Netflix Prize", Y. Zhou, D. Wilkinson, R. Schreiber and R. Pan, 2008

Install:

# clone repoisitory
git clone [email protected]:chyikwei/recommend.git
cd recommend

# install numpy & scipy
pip install -r requirements.txt
pip install .

Getting started:

A jupyter notbook that compares PMF and BPMF model can be found here.
To run BPMF with MovieLens 1M dataset: First, download MovieLens 1M dataset and unzip it (data will be in ml-1m folder). Then run:

>>> import numpy as np
>>> from recommend.bpmf import BPMF
>>> from recommend.utils.evaluation import RMSE
>>> from recommend.utils.datasets import load_movielens_1m_ratings

# load user ratings
>>> ratings = load_movielens_1m_ratings('ml-1m/ratings.dat')
>>> n_user = max(ratings[:, 0])
>>> n_item = max(ratings[:, 1])
>>> ratings[:, (0, 1)] -= 1 # shift ids by 1 to let user_id & movie_id start from 0

# fit model
>>> bpmf = BPMF(n_user=n_user, n_item=n_item, n_feature=10,
                max_rating=5., min_rating=1., seed=0).fit(ratings, n_iters=20)
>>> RMSE(bpmf.predict(ratings[:, :2]), ratings[:,2]) # training RMSE
0.79784331768263683

# predict ratings for user 0 and item 0 to 9:
>>> bpmf.predict(np.array([[0, i] for i in xrange(10)]))
array([ 4.35574067,  3.60580936,  3.77778456,  3.4479072 ,  3.60901065,
        4.29750917,  3.66302187,  4.43915423,  3.85788772,  4.02423073])

Complete examples can be found in examples/ folder. The scripts will download MovieLens 1M dataset automatically, run PMF(BPMF) model and show training/validation RMSE.

Running Test:

python setup.py test

or run test with coverage:

coverage run --source=recommend setup.py test
coverage report -m

Uninstall:

pip uninstall recommend

Notes:

Old version code can be found in v0.0.1. It contains a Probabilistic Matrix Factorization model with theano implementation.
The previous version (0.2.1) did not implement correctly MCMC sampling in the BPMF algorithm. In fact, at every timestep it computed the predictions basing on the current value of the feature matrices, and used it to estimate the RMSE. This has no meaning from the MCMC point of view, whose purpose is to sample the feature matrices from the correct distributions in order to estimate the integral through which the rating ditribution is computed. Instead, the correct approach (see Eq. 10 in reference [2]) entails averaging the predictions at every time step to get a final prediction and compute the RMSE. Essentially, the predicted value itself does not depend only on the last extracted value for the feature matrices, but on the whole chain. Having modified this, the RMSE for both the train and test set with BPMF improves (you can see it in this notebook). (Thanks LoryPack's contribution!)

recommend's People

Contributors

Stargazers

Watchers

Forkers

secontao khodeir datascitest michaelshing rock999 ian09 tfalcao iamzbl shashankg7 connectsoumya tpnguyen guanlongtianzi code-hunter zbxzc35 veterun jiangpp jz3707 mingleili adrianhsu chenguodan kobauman guodanchens perryhau hustlrr forestdengtech maidousi dolphyxu vikibytes ajoeajoe wujinming d4le maggie0830 mackeee-orange deen12520 saketjnu michaelldd dragoncircle specialshe ading1977 truthliu minghao2016 mdiby ynxu15 s1162276945 gcr1218 polaris79 kevingetandgive enyee hu19891110 alexleethinker tomcruise777 zoujun123 ethan-jianfei kiwi4py j3ts proqwest movinghera zhaok12 sanjaykrishnamurthy moxiaoshao erinorange zxshinxz drr3d schaelle channingping muyimo wjx976190705 lorypack yangjunlei12 bhoomi17 mcmaxmm trillionpowers bpraveen92 afcarl kyla1994 kontvis jopdaalmans multiplecrashes hainuyxg jflafan kyrie-chow davidzeng2018 yshihui mckenzypg yidiandiandian malongge wxb506 nishalpradhan999 wangtaogh zhongzhengang hjh2019 mingxuanliu originval pengyuange haesoly singhankur7 reverliu marekzhang yeliuxiang fabs2017

recommend's Issues

Multiplication between item_features and user_features

Hi,
I've a question:
why if I multiply item_features and user_features I don't obtain the initial matrix which I give as input to PMF (or BPMF) model? Is it a matrix factorization technique, no? If so, it should be true that the multiplication of that elements gives the initial matrix.

Problems with PMF

Hi,
My professor and I have investigated deeply this library for applying PMF on a ratings matrix and we have encountered a problem during fit() method of PMF.
With 'ratings_matrix' we indicate the ratings matrix which we give as input to this library.

ratings_matrix = [[ 9.00000000e+00 2.71600000e+03 4.00000000e+00]
[ 3.84100000e+03 3.50300000e+03 5.00000000e+00]
[ 2.89900000e+03 1.24600000e+03 5.00000000e+00]
...,
[ 2.63700000e+03 1.26700000e+03 3.00000000e+00]
[ 2.83000000e+02 2.42000000e+03 1.00000000e+00]
[ 4.55100000e+03 1.64000000e+02 3.00000000e+00]]

When we call fit() method - pmf = PMF(n_user = number_of_users, n_item = number_of_items, n_feature = number_of_features, min_rating = min_rating_value, max_rating = max) - , with Numpy version 1.13.3, we encountered this problem (the following):
Traceback (most recent call last):
File "/Users/edoardo/PycharmProjects/MasterThesisProject/Thesis.py", line 529, in
final_ratings_matrix_pmf = probabilistic_matrix_factorization_technique(ratings_matrix)
File "/Users/edoardo/PycharmProjects/MasterThesisProject/Thesis.py", line 250, in probabilistic_matrix_factorization_technique
pmf = PMF(n_user = number_of_users, n_item = number_of_items, n_feature = number_of_features, min_rating = min_rating_value, max_rating = max_rating_value, seed = initial_seed)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/recommend/pmf.py", line 56, in init
self.user_features_ = 0.1 * self.random_state.rand(n_user, n_feature)
File "mtrand.pyx", line 1358, in mtrand.RandomState.rand
File "mtrand.pyx", line 856, in mtrand.RandomState.random_sample
File "mtrand.pyx", line 167, in mtrand.cont0_array
TypeError: 'numpy.float64' object cannot be interpreted as an index

When we call fit() method - , with Numpy version 1.11.0, we encountered this other problem (the following):
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/recommend/pmf.py:56: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
self.user_features_ = 0.1 * self.random_state.rand(n_user, n_feature)
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/recommend/pmf.py:57: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
self.item_features_ = 0.1 * self.random_state.rand(n_item, n_feature)
PMF training and testing phases
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/recommend/pmf.py:70: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
u_feature_mom = np.zeros((self.n_user, self.n_feature))
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/recommend/pmf.py:71: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
i_feature_mom = np.zeros((self.n_item, self.n_feature))
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/recommend/pmf.py:73: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
u_feature_grads = np.zeros((self.n_user, self.n_feature))
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/recommend/pmf.py:74: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
i_feature_grads = np.zeros((self.n_item, self.n_feature))
Traceback (most recent call last):
File "/Users/edoardo/PycharmProjects/MasterThesisProject/Thesis.py", line 529, in
final_ratings_matrix_pmf = probabilistic_matrix_factorization_technique(ratings_matrix)
File "/Users/edoardo/PycharmProjects/MasterThesisProject/Thesis.py", line 254, in probabilistic_matrix_factorization_technique
pmf.fit(training_set, n_iters = evaluation_iterations)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/recommend/pmf.py", line 87, in fit
data.take(0, axis=1), axis=0)
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

Have you ever encountered this type of problem? We don't know how to solve this problem!

Any help can be useful! Thanks a lot!

can 'numpy.take()' be replaced by “fancy” indexing (indexing arrays using arrays)?

why using numpy.take() rather than “fancy” indexing (indexing arrays using arrays) , for example in pmf.py, while the “fancy” indexing is much faster.

A small mistake in the algorithm

Though the program runs without errors, there is a bug in the algorithm. The following will be the corrected ones. Line 123 in https://github.com/chyikwei/recommend/blob/master/mf/bayesian_matrix_factorization.py

 WI_post = inv(inv(self.WI_item) + N * S_bar + \
            np.dot(norm_X_bar, norm_X_bar.T) * \
            (N * self.beta_item) / (self.beta_item + N))

Line 157 in https://github.com/chyikwei/recommend/blob/master/mf/bayesian_matrix_factorization.py

 WI_post = inv(inv(self.WI_user) + N * S_bar + \
            np.dot(norm_X_bar, norm_X_bar.T) * \
            (N * self.beta_user) / (self.beta_user + N))

Running bmf with my data

I have got a 31x9 matrix and I want to perform bmf through your code. Firstly, I read the matrix in the sparse format (180x3) as in the case of your example. Then, I calculate the max of the first and second col and trying to perform your code:

print n_user 31
print n_item 9
print n_feat 15
print ratings #numpy np.array

[[ 1  1 11]
 [ 1  5  7]
 [ 1  6 12]
...
 [31  5  7]
 [31  6  9]
 [31  8  9]]

#fit model
bpmf = BPMF(n_user=n_user, n_item=n_item, n_feature=n_feat,
                max_rating=15., min_rating=0., seed=0).fit(ratings, n_iters=20)
print RMSE(bpmf.predict(ratings[:, :2]), ratings[:,2]) # training RMSE

And I am receiving the following message: raise ValueError("max user_id >= %d", n_user)
ValueError: ('max user_id >= %d', 31)
What am I doing wrong? Actually it is working if I put n_user = 32 and n_item = 10. But does that make any sense? Furthermore the results of the bpmf.predict(ratings) are just the approximated values in my initial resutls. What about the rest of the values?

Parameter 'Lambda' in the ALS-WR Model

Hi,
I've seen the implementation of the ALS-WR model provided in this library.
This model has got a parameter called 'lambda'. I've a question: can I set the 'lambda' parameter of the model and how? The 'lambda' parameter in the reference paper is the 'reg' parameter in this implementation, right? If wrong, how can I set the 'lambda' parameter?

Thanks for the answer!
Best regards.

add "burn-in" parameter in bpmf code

based on discussion in pr #24

RuntimeWarning: overflow encountered in multiply

I let eval_iters = 50, I encountered a problem(RuntimeWarning: overflow encountered in multiply). With the number of eval_iters' increase, I only want to have the minimized RMSE. But I don't know how to solve the problem.
Here is the question:
recommend-0.1.0-py2.7.egg/recommend/pmf.py:86: RuntimeWarning: overflow encountered in multiply
recommend-0.1.0-py2.7.egg/recommend/pmf.py:88: RuntimeWarning: overflow encountered in multiply
recommend-0.1.0-py2.7.egg/recommend/pmf.py:89: RuntimeWarning: overflow encountered in multiply
recommend-0.1.0-py2.7.egg/recommend/pmf.py:90: RuntimeWarning: overflow encountered in multiply
recommend-0.1.0-py2.7.egg/recommend/pmf.py:97: RuntimeWarning: invalid value encountered in add
recommend-0.1.0-py2.7.egg/recommend/pmf.py:104: RuntimeWarning: invalid value encountered in add
recommend-0.1.0-py2.7.egg/recommend/pmf.py:133: RuntimeWarning: invalid value encountered in greater
site-packages/recommend-0.1.0-py2.7.egg/recommend/pmf.py:136: RuntimeWarning: invalid value encountered in less
INFO: iter: 24, train RMSE: nan
INFO: iter: 25, train RMSE: nan
INFO: iter: 26, train RMSE: nan
INFO: iter: 27, train RMSE: nan ...

BMF and PMF parameters

Hi there, I am trying to figure out about the parameters of BMF and PMF algoriths that you are using to tune the equation of the optimization function. Could you please elaborate bit about the parameters needed tuning?

Problem with PMF [fit() method]

Hi,
I'm a computer science student in Milan. My goal is to use this library in order to perform PMF on my ratings matrix.
The problem I encountered is the following:
"File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/recommend/pmf.py", line 62, in fit
self.max_rating, self.min_rating)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/recommend/utils/validation.py", line 22, in check_ratings
raise ValueError("max user_id >= %d", n_user)
ValueError: ('max user_id >= %d', 4)

My ratings matrix is like your ratings matrix in your examples (so, each user and item id start from 0, not from 1, so I don't have to set -1 for the id). One example of my ratings matrix is the following:

"[[0 0 1]
[0 1 2]
[0 2 1]
[0 3 1]
[0 4 0]
[1 0 1]
[1 1 0]
[1 2 4]
[1 3 4]
[1 4 4]
[2 0 1]
[2 1 1]
[2 2 0]
[2 3 1]
[2 4 1]
[3 0 5]
[3 1 4]
[3 2 5]
[3 3 5]
[3 4 0]
[4 0 3]
[4 1 1]
[4 2 0]
[4 3 3]
[4 4 3]]"

As you can see, max user id = 4 and also max item id = 4 (and both start from 0, not from 1).

When I call fit() method, my program stops giving me that error.
I don't know how to solve my problem.
Can you give me a help?

Thanks.

try reproduce BPMF_MCMC_correct_wrong_comparison notebook

try reproduce this notebook from scratch:
https://github.com/chyikwei/recommend/blob/master/examples/BPMF_MCMC_correct_wrong_comparison.ipynb

Predict Users and Items Latent Features with PMF or BPMF Models

Hi,
The pipeline I’ve implemented is the following:

train a PMF model with a ratings matrix as input, obtaining latent features for both items and users;
predict the rating values for a specific user whose identifier is in the input ratings matrix, by using the latent features extracted with the previous step.

Now I would like to use the trained PMF model in order to extract the latent features (in order to compute the predicted rating value) for new users whose identifiers are not in the input ratings matrix, i.e. predict rating values and obtain latent features for users not present while PMF training. It is possible to do? How? Please report a code example, if possible.

It is possible to do the same task with the BPMF model?

Thanks for the answers,
Best regards.