Giter Site home page Giter Site logo

etlundquist / rankfm Goto Github PK

View Code? Open in Web Editor NEW
170.0 12.0 36.0 2.2 MB

Factorization Machines for Recommendation and Ranking Problems with Implicit Feedback Data

License: GNU General Public License v3.0

Python 86.95% Makefile 0.13% C 12.92%
machine-learning recommendation factorization-machines recommender-system learning-to-rank implicit-feedback collaborative-filtering

rankfm's Introduction

RankFM

PyPI version CircleCI Documentation Status License: GPL v3

RankFM is a python implementation of the general Factorization Machines model class adapted for collaborative filtering recommendation/ranking problems with implicit feedback user/item interaction data. It uses Bayesian Personalized Ranking (BPR) and a variant of Weighted Approximate-Rank Pairwise (WARP) loss to learn model weights via Stochastic Gradient Descent (SGD). It can (optionally) incorporate sample weights and user/item auxiliary features to augment the main interaction data.

The core (training, prediction, recommendation) methods are written in Cython, making it possible to scale to millions of user/item interactions. Designed for ease-of-use, RankFM accepts both pd.DataFrame and np.ndarray inputs - you do not have to convert your data to scipy.sparse matrices or re-map user/item identifiers prior to use. RankFM internally maps all user/item identifiers to zero-based integer indexes, but always converts its output back to the original user/item identifiers from your data, which can be arbitrary (non-zero-based, non-consecutive) integers or even strings.

In addition to the familiar fit(), predict(), recommend() methods, RankFM includes additional utilities similiar_users() and similar_items() to find the most similar users/items to a given user/item based on latent factor space embeddings. A number of popular recommendation/ranking evaluation metric functions have been included in the separate evaluation module to streamline model tuning and validation.

  • see the Quickstart section below to get started with the basic functionality
  • see the /examples folder for more in-depth jupyter notebook walkthroughs with several popular open-source data sets
  • see the Online Documentation for more comprehensive documentation on the main model class and separate evaluation module
  • see the Medium Article for contextual motivation and a detailed mathematical description of the algorithm

Dependencies

  • Python 3.6+
  • numpy >= 1.15
  • pandas >= 0.24

Installation

Prerequisites

To install RankFM's C extensions you will need the GNU Compiler Collection (GCC). Check to see whether you already have it installed:

gcc --version

If you don't have it already you can easily install it using Homebrew on OSX or your default linux package manager:

# OSX
brew install gcc

# linux
sudo yum install gcc

# ensure [gcc] has been installed correctly and is on the system PATH
gcc --version

Package Installation

You can install the latest published version from PyPI using pip:

pip install rankfm

Or alternatively install the current development build directly from GitHub:

pip install git+https://github.com/etlundquist/rankfm.git#egg=rankfm

It's highly recommended that you use an Anaconda base environment to ensure that all core numpy C extensions and linear algebra libraries have been installed and configured correctly. Anaconda: it just works.

Quickstart

Let's work through a simple example of fitting a model, generating recommendations, evaluating performance, and assessing some item-item similarities. The data we'll be using here may already be somewhat familiar: you know it, you love it, it's the MovieLens 1M!

Let's first look at the required shape of the interaction data:

user_id item_id
3 233
5 377
8 610

It has just two columns: a user_id and an item_id (you can name these fields whatever you want or use a numpy array instead). Notice that there is no rating column - this library is for implicit feedback data (e.g. watches, page views, purchases, clicks) as opposed to explicit feedback data (e.g. 1-5 ratings, thumbs up/down). Implicit feedback is far more common in real-world recommendation contexts and doesn't suffer from the missing-not-at-random problem of pure explicit feedback approaches.

Now let's import the library, initialize our model, and fit on the training data:

from rankfm.rankfm import RankFM
model = RankFM(factors=20, loss='warp', max_samples=20, alpha=0.01, sigma=0.1, learning_rate=0.1, learning_schedule='invscaling')
model.fit(interactions_train, epochs=20, verbose=True)
# NOTE: this takes about 30 seconds for 750,000 interactions on my 2.3 GHz i5 8GB RAM MacBook

If you set verbose=True the model will print the current epoch number as well as the epoch's log-likelihood during training. This can be useful to gauge both computational speed and training gains by epoch. If the log likelihood is not increasing then try upping the learning_rate or lowering the (alpha, beta) regularization strength terms. If the log likelihood is starting to bounce up and down try lowering the learning_rate or using learning_schedule='invscaling' to decrease the learning rate over time. If you run into overflow errors then decrease the feature and/or sample-weight magnitudes and try upping beta, especially if you have a small number of dense user-features and/or item-features. Selecting BPR loss will lead to faster training times, but WARP loss typically yields superior model performance.

Now let's generate some user-item model scores from the validation data:

valid_scores = model.predict(interactions_valid, cold_start='nan')

this will produce an array of real-valued model scores generated using the Factorization Machines model equation. You can interpret it as a measure of the predicted utility of item (i) for user (u). The cold_start='nan' option can be used to set scores to np.nan for user/item pairs not found in the training data, or cold_start='drop' can be specified to drop those pairs so the results contain no missing values.

Now let's generate our topN recommended movies for each user:

valid_recs = model.recommend(valid_users, n_items=10, filter_previous=True, cold_start='drop')

The input should be a pd.Series, np.ndarray or list of user_id values. You can use filter_previous=True to prevent generating recommendations that include any items observed by the user in the training data, which could be useful depending on your application context. The result will be a pd.DataFrame where user_id values will be the index and the rows will be each user's top recommended items in descending order (best item is in column 0):

0 1 2 3 4 5 6 7 8 9
3 2396 1265 357 34 2858 3175 1 2028 17 356
5 608 1617 1610 3418 590 474 858 377 924 1036
8 589 1036 2571 2028 2000 1220 1197 110 780 1954

Now let's see how the model is performing wrt the included validation metrics evaluated on the hold-out data:

from rankfm.evaluation import hit_rate, reciprocal_rank, discounted_cumulative_gain, precision, recall

valid_hit_rate = hit_rate(model, interactions_valid, k=10)
valid_reciprocal_rank = reciprocal_rank(model, interactions_valid, k=10)
valid_dcg = discounted_cumulative_gain(model, interactions_valid, k=10)
valid_precision = precision(model, interactions_valid, k=10)
valid_recall = recall(model, interactions_valid, k=10)
hit_rate: 0.796
reciprocal_rank: 0.339
dcg: 0.734
precision: 0.159
recall: 0.077

That's a Bingo!

Now let's find the most similar other movies for a few movies based on their embedding representations in latent factor space:

# Terminator 2: Judgment Day (1991)
model.similar_items(589, n_items=10)
2571                       Matrix, The (1999)
1527                Fifth Element, The (1997)
2916                      Total Recall (1990)
3527                          Predator (1987)
780             Independence Day (ID4) (1996)
1909    X-Files: Fight the Future, The (1998)
733                          Rock, The (1996)
1376     Star Trek IV: The Voyage Home (1986)
480                      Jurassic Park (1993)
1200                            Aliens (1986)

I hope you like explosions...

# Being John Malkovich (1999)
model.similar_items(2997, n_items=10)
2599           Election (1999)
3174    Man on the Moon (1999)
2858    American Beauty (1999)
3317        Wonder Boys (2000)
223              Clerks (1994)
3897      Almost Famous (2000)
2395           Rushmore (1998)
2502       Office Space (1999)
2908     Boys Don't Cry (1999)
3481      High Fidelity (2000)

Let's get weird...

rankfm's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rankfm's Issues

KeyError: 'the items in [item_features] do not match the items in [interactions]'

item_features_train = pd.get_dummies(train_interactions[['Items', 'moment']], columns=['moment'])

I am classifying my items into fast, medium, slow moving items.
for this I am using the parameter "item_features".

It gives me this error:
model.fit(train_user_item, user_features=None, item_features=item_features_train, sample_weight=sample_weight_train, epochs=epochs, verbose=verbose) File "/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py", line 265, in fit self.fit_partial(interactions, user_features, item_features, sample_weight, epochs, verbose) File "/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py", line 289, in fit_partial self._init_all(interactions, user_features, item_features, sample_weight) File "/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py", line 135, in _init_all self._init_features(user_features, item_features) File "/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py", line 214, in _init_features raise KeyError('the items in [item_features] do not match the items in [interactions]') KeyError: 'the items in [item_features] do not match the items in [interactions]'
can someone help me with this?
And I have also gone through the example notebooks. I noticed that you have constructed the item features but not used it in instacart example. It would be if those notebooks were updated.

installation on windows 11 fails

installation on windows 11 fails
sep13 N1\Fastapi multi replica>pip install rankfm
Collecting rankfm
Using cached rankfm-0.2.5.tar.gz (145 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: numpy>=1.15 in c:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages (from rankfm) (1.24.3)
Requirement already satisfied: pandas>=0.24 in c:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages (from rankfm) (1.5.3)
Requirement already satisfied: pytz>=2020.1 in c:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages (from pandas>=0.24->rankfm) (2023.3)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages (from pandas>=0.24->rankfm) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages (from python-dateutil>=2.8.1->pandas>=0.24->rankfm) (1.16.0)
Building wheels for collected packages: rankfm
Building wheel for rankfm (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [20 lines of output]
building extensions with pre-generated C source...
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-310
creating build\lib.win-amd64-cpython-310\rankfm
copying rankfm\evaluation.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm\rankfm.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm\utils.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm_init_.py -> build\lib.win-amd64-cpython-310\rankfm
running build_ext
building 'rankfm._rankfm' extension
creating build\temp.win-amd64-cpython-310
creating build\temp.win-amd64-cpython-310\Release
creating build\temp.win-amd64-cpython-310\Release\rankfm
creating build\temp.win-amd64-cpython-310\Release\rankfm\mt19937ar
"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\my_py_environments\py310_env_flaml_aug1_2023\include -IC:\Users\cde3\AppData\Local\Programs\Python\Python310\include -IC:\Users\cde3\AppData\Local\Programs\Python\Python310\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /Tcrankfm/_rankfm.c /Fobuild\temp.win-amd64-cpython-310\Release\rankfm/_rankfm.obj -O2 -ffast-math -Wno-unused-function -Wno-uninitialized
cl : Command line error D8021 : invalid numeric argument '/Wno-unused-function'
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for rankfm
Running setup.py clean for rankfm
Failed to build rankfm
Installing collected packages: rankfm
Running setup.py install for rankfm ... error
error: subprocess-exited-with-error

× Running setup.py install for rankfm did not run successfully.
│ exit code: 1
╰─> [22 lines of output]
building extensions with pre-generated C source...
running install
C:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-310
creating build\lib.win-amd64-cpython-310\rankfm
copying rankfm\evaluation.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm\rankfm.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm\utils.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm_init_.py -> build\lib.win-amd64-cpython-310\rankfm
running build_ext
building 'rankfm._rankfm' extension
creating build\temp.win-amd64-cpython-310
creating build\temp.win-amd64-cpython-310\Release
creating build\temp.win-amd64-cpython-310\Release\rankfm
creating build\temp.win-amd64-cpython-310\Release\rankfm\mt19937ar
"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\my_py_environments\py310_env_flaml_aug1_2023\include -IC:\Users\cde3\AppData\Local\Programs\Python\Python310\include -IC:\Users\cde3\AppData\Local\Programs\Python\Python310\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /Tcrankfm/_rankfm.c /Fobuild\temp.win-amd64-cpython-310\Release\rankfm/_rankfm.obj -O2 -ffast-math -Wno-unused-function -Wno-uninitialized
cl : Command line error D8021 : invalid numeric argument '/Wno-unused-function'
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> rankfm

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

[notice] A new release of pip is available: 23.0.1 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip

Convergence issues

I read in the documentation:

"If you run into overflow errors then decrease the feature and/or sample-weight magnitudes and try upping beta, especially if you have a small number of dense user-features and/or item-features."

I di not understand the meaning of decreasing the weight magnitudes. Currently, all weights are set to 1. Are you suggesting setting them to say, 0.5, the same for all rows? But in that case, there will be no difference.

Default value of beta is set to 0.1. How high do you recommend raising it?

Thanks.

User and item features

RankFM has user and item features, and that is great. However, I have a use case with features that cannot be put into this nice form. Specifically, I have some user-item pairs that occur multiple times in my data set. For example, a user might purchase the same product on different days, or purchase multiple products. It does not appear that this use case can be handled by the current API. On the other hand, the algorithm found in the function compute_ui_utility should be able to handle this case with some modification. Until this morning, I was under the impression that each row of the user-product array had a series of features.

So my question is: can I modify the library to handle the use case where I have N features per row without requiring features at the user and item levels?

I hope I was clear. Thank you for any advice.

Suggestion for changing multiplier in _rankfm.pyx

First of all, thank you for providing good code. But I would like to suggest changing the multiplier in the WARP part
(line 269 in _rankfm.pyx)

from
multiplier = log((I - 1) / sampled) / log(I)

to
multiplier = log((I - items_user[u]) / sampled) / log(I)

From a mathematical view point, the numerator should be the size of the population of j (in this case negative items for u). Since the population of j is the complement of user_items[u], I think it's better to change the the numerator to I - items_user[u].

Unable to install on Windows

I've been trying to install RankFM on Windows but haven't been able to do so.
First, I tried installing cygwin to get the GNU Compiler Collection on Windows (now I can use gcc on cmd) but when installing, I received "Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools". Then, I installed the latest version of Microsoft Visual C++ and now is throwing the following:

  Building wheel for rankfm (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [20 lines of output]
      building extensions with pre-generated C source...
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-3.9
      creating build\lib.win-amd64-3.9\rankfm
      copying rankfm\evaluation.py -> build\lib.win-amd64-3.9\rankfm
      copying rankfm\rankfm.py -> build\lib.win-amd64-3.9\rankfm
      copying rankfm\utils.py -> build\lib.win-amd64-3.9\rankfm
      copying rankfm\__init__.py -> build\lib.win-amd64-3.9\rankfm
      running build_ext
      building 'rankfm._rankfm' extension
      creating build\temp.win-amd64-3.9
      creating build\temp.win-amd64-3.9\Release
      creating build\temp.win-amd64-3.9\Release\rankfm
      creating build\temp.win-amd64-3.9\Release\rankfm\mt19937ar
      "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.31.31103\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -Ic:\users\makue\documents\rankfm\venv\include -IC:\Users\makue\AppData\Local\Progra
ms\Python\Python39\include -IC:\Users\makue\AppData\Local\Programs\Python\Python39\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.31.31103\include" "-IC:\Program Files (x86)\Windows Kits\10\includ
e\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.19041.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.19041.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.19041.0\\winrt
" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.19041.0\\cppwinrt" /Tcrankfm/_rankfm.c /Fobuild\temp.win-amd64-3.9\Release\rankfm/_rankfm.obj -O2 -ffast-math -Wno-unused-function -Wno-uninitialized
      cl : Command line error D8021 : invalid numeric argument '/Wno-unused-function'
      error: command 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.31.31103\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
      [end of output]

Can you please tell me what can I do? Thank you

Cannot incorporate item_feature or user_feature in fit()

I tried to use item_feature in the fit() method but I got:
`/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py in _init_features(self, user_features, item_features)
212 self.x_if = np.ascontiguousarray(x_if.sort_index(), dtype=np.float32)
213 else:
--> 214 raise KeyError('the items in [item_features] do not match the items in [interactions]')
215 else:
216 self.x_if = np.zeros([len(self.item_idx), 1], dtype=np.float32)

KeyError: 'the items in [item_features] do not match the items in [interactions]`

and for adding user_feature, I got similar error:
/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py in _init_features(self, user_features, item_features) 200 self.x_uf = np.ascontiguousarray(x_uf.sort_index(), dtype=np.float32) 201 else: --> 202 raise KeyError('the users in [user_features] do not match the users in [interactions]') 203 else: 204 self.x_uf = np.zeros([len(self.user_idx), 1], dtype=np.float32) KeyError: 'the users in [user_features] do not match the users in [interactions]'

I double-checked my data and there are matching catalog_ids and user_ids in both training data and the feature data.

What could be the issue?

Error while fit with 200k user_interaction matrix, item features and user features

I'm running the lib on a virtual server with 64gb RAM.
My data consist of:
200k distinct interaction between users and item
52k x 11 user_feature matrix
2770 x 49 item_feature matrix
all NA are replaced by 0

when i try to run it gives me this error:
AssertionError: user factors [v_u] are not finite - try decreasing feature/sample_weight magnitudes
sometimes it would give me item factors error as well

However, if I run on 170k user interaction without user_features and item_features it would run smoothly

What is the meaning of the error?

is there is to utilize all cpus during training ?

thanks for your great library ... one thing i noticed that the model training doesn't utilize all available cpus in the machine, therefore training process is very slow for larger datasets ... is there any parameter to pass to enable multi cpu training

Question: User/Item Interaction Features

This looks like a very promising library - congrats!

I am not familiar with the theory yet, but is it possible to include user/interaction features? For example, a typical use case is the amount of time elapsed since a product was last purchased.

Cannot generate the normalize dcg

Hello,
I am having great experience with your library.
For research purpose i need to compute the ndcg but you only propose the dcg.
Could you please add it or give us some hint on how to proceed it.
Thanks

doubt on how to save and load model

1.can you pls explain how to save and loas the best model.
2.also is there any way that you could parallelise(use multiprocessing) if possible in the training /prediction part as i have observed that only one core of my machine is being used,

what is the sense to demo your code on data without auxiliary features, when you claim auxiliary feature specific code ?

may you clarify how your code works with key advertised feature
as written in
https://towardsdatascience.com/factorization-machines-for-item-recommendation-with-implicit-feedback-data-5655a7c749db

To overcome these limitations we need a more general model framework that can extend the latent factor approach to incorporate arbitrary auxiliary features, and specialized loss functions that directly optimize item rank-order using implicit feedback data. Enter Factorization Machines and Learning-to-Rank.

but you testing your code on data without auxiliary feature
as you wrote
Unfortunately, there are no user auxiliary features to take advantage of with this data set.

what is the sense to demo your code on data without auxiliary features, when you claim auxiliary feature specific code ?

NaNs leading to KeyError while comparing arrays during user_item_index vectors generation


KeyError Traceback (most recent call last)
in
----> 1 model.fit(interactions, user_features, item_features, sample_weight, epochs=50, verbose=True)

~/.virtualenv/turicreate/lib/python3.8/site-packages/rankfm/rankfm.py in fit(self, interactions, user_features, item_features, sample_weight, epochs, verbose)
263
264 self._reset_state()
--> 265 self.fit_partial(interactions, user_features, item_features, sample_weight, epochs, verbose)
266
267

~/.virtualenv/turicreate/lib/python3.8/site-packages/rankfm/rankfm.py in fit_partial(self, interactions, user_features, item_features, sample_weight, epochs, verbose)
287 self._init_features(user_features, item_features)
288 else:
--> 289 self._init_all(interactions, user_features, item_features, sample_weight)
290
291 # determine the number of negative samples to draw depending on the loss function

~/.virtualenv/turicreate/lib/python3.8/site-packages/rankfm/rankfm.py in _init_all(self, interactions, user_features, item_features, sample_weight)
133
134 # map the user/item features to internal index positions
--> 135 self._init_features(user_features, item_features)
136
137 # initialize the model weights after the user/item/feature dimensions have been established

~/.virtualenv/turicreate/lib/python3.8/site-packages/rankfm/rankfm.py in _init_features(self, user_features, item_features)
200 self.x_uf = np.ascontiguousarray(x_uf.sort_index(), dtype=np.float32)
201 else:
--> 202 raise KeyError('the users in [user_features] do not match the users in [interactions]')
203 else:
204 self.x_uf = np.zeros([len(self.user_idx), 1], dtype=np.float32)

KeyError: 'the users in [user_features] do not match the users in [interactions]'

Citation?

Hi,

Thank you so much for this library. AFAIK it is the only FM lib with WARP loss.

I was thinking of using it, and I was wondering whether you have a source (paper) for the actual implementation you followed, or any special citing for it.

Thank you and keep the great work!

item_features are not actually used in the tutorial, and other questions.

Just a few things I'm not clear on going through the medium article (https://towardsdatascience.com/factorization-machines-for-item-recommendation-with-implicit-feedback-data-5655a7c749db) and reading the notebook here on github.

1.

The medium talks about using item / user features as part of a recommender system. The accompanying notebook creates item features based on the aisle number:

item_features
item_features_train
item_features_valid

and never uses them. Unless I missed something? I check the docs also. Should the notebook be using them somewhere?

Should the product_id be in the index of item_features_train? Because in the notebook it's just a regular column:

image

Anyway, it looks like item_features_train goesinto model.fit()

model.fit(interactions_train, sample_weight=sample_weight_train, epochs=30, verbose=True, item_features=item_features_train)

I've compared the evaluation using item_features_train to the original model in the notebook which doesn't use item_features_train. The scores are slightly lower when I use item_features but not by much. Which makes me think I'm either doing something wrong (should product_id be in the index of item_features_train?), or these features are just not great and don't add anything other than noise.

image

2.

The dataset also has department id as a potential feature. How would we use both aisle and department id in the same model? The item_features argument in model.fit() takes a single dummified dataframe. Do we use fit_partial to update the model with extra dataframes of user features or item features? Let's say I have the age, city, gender, and monthly income of my users. This would be four dataframes. How would I use them all?

3.

What is item_features_valid used for? This is created in the notebook but never used.

4.

There is another section where scores are generated:

scores = model.predict(interactions_valid, cold_start='nan')

print(scores)

array([-1.1237175 ,  0.00314923,  1.5434608 , ...,  2.029984  ,
        1.8916801 ,  2.7111115 ], dtype=float32)

What are these used for? They are not used in the notebook after generating them.

Bug when using user features?

When running fit() with user features, I get the error:

KeyError: 'the users in [user_features] do not match the users in [interactions]'

which has been reported previously. In my case, I did some debugging in the source code, and found the following. In the function _init_interactions, one finds the statement:

            if np.array_equal(sorted(x_uf.index.values), self.user_idx):
                self.x_uf = np.ascontiguousarray(x_uf.sort_index(), dtype=np.float32)
            else:
                raise KeyError('the users in [user_features] do not match the users in [interactions]')

which is the error in question. Looking at the definition of self.user_idx, one finds, in the same file rankfm.py:

        # store unique values of user/item indexes and observed interactions for each user
        self.user_idx = np.arange(len(self.user_id), dtype=np.int32)
        self.item_idx = np.arange(len(self.item_id), dtype=np.int32)

near line 128. Clearly, self.user_idx are consecutive indexes 0,1,2, ... up to the number of user ids.
However, sorted(x_uf.index.values) is the sorted list of user ids. Thus, the two lists cannot be equal. The code that leads me to this conclusions is:

        if user_features is not None:
            x_uf = pd.DataFrame(user_features.copy())
            x_uf = x_uf.set_index(x_uf.columns[0])
            x_uf.index = x_uf.index.map(self.user_to_index)
            if np.array_equal(sorted(x_uf.index.values), self.user_idx):

As far as I understand, the first column of user_features, which is an argument to the function, should be the actual user_id, which can be anything, as long as it does not appear twice in the dataframe. In this case, the conditional (last line) can not be satisfied.
Therefore, I must not understand the data format of user_features. Where is this explained? The documentation states the following:

user_features – dataframe of user metadata features: [user_id, uf_1, … , uf_n]

with no additional information regarding the values of user_id. Any clarification would be most welcome!

Capturing the loss function

Hi,

I would like to capture the loss function as a function of epoch into an array. Currently, it is only possible to print it to stdout via the verbose=True argument of fit. Could the code be enhanced to allow the user to specify calling functions? Alternatively, return the loss function from the C code? Thanks.

may you clarify if you tried / tested your code with auxiliary features

may you clarify if you tried / tested your code with auxiliary features

since in your blog
https://towardsdatascience.com/factorization-machines-for-item-recommendation-with-implicit-feedback-data-5655a7c749db
you wrote
Unfortunately, there are no user auxiliary features to take advantage of with this data set.

but your developments is essential to have auxiliary features

may be since you found data with auxiliary features?

Tracking the loss function

I would like to store the time-dependent loss function in an array. It would be nice if there was a hook function that would allow me to do this, or have the call to the trainer return this list. Can anybody help with that? The compilation to C complicates matters. Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.