mobiletelesystems / rectools Goto Github PK

RecTools - library to build Recommendation Systems easier and faster than ever before

License: Apache License 2.0

Makefile 0.25% Python 99.74% Dockerfile 0.01%

deep-learning machine-learning personalization recomendations recommendation-system recsys recommendation-algorithms recommendation-engine recommender-system

rectools's People

Contributors

Stargazers

Watchers

rectools's Issues

Functions to load popular datasets

Add some functions to load commonly used recommendation datasets like movielens, lastfm, kion, etc.

Think about:

Caching (should we implement saving loaded data? if yes, then how to do this?)
Should we provide raw data or convert it to our structures?
Should we keep some tiny datasets together with the code like scikit-learn does?

Cold & warm users/items support in vector models

Add the ability to recommend for cold users and items to models:

LightFM
DSSM (warm only)

Duplicate example notebooks in Colab

Copy existing Jupyter notebooks to Google Colab and add links to Readme and docs

Holdout fold in splitter and cross-validate

Feature Description

add_holdout_fold parameter for splitters. default is False
run_on_holdout_fold parameter in cross-validate. default is False

Why this feature?

Holdout validation is an import part of experiments. It's nice ti have it out of the box.

Additional context

No response

np.setdiff1d is too slow

If user_id and item_id columns are CategoryDType, then np.setdiff1d works very slowly on large volumes (>10 million unique ones)
Possible solution is to replace:

RecTools/rectools/model_selection/time_split.py

Line 146 in 76c41e0

    
           new_users = np.setdiff1d(df_test[Columns.User].unique(), df_train[Columns.User].unique())

with

new_users = set(df_test[Columns.User].unique()) - set(df_train[Columns.User].unique())

And same for

RecTools/rectools/model_selection/time_split.py

Line 150 in 76c41e0

    
           new_items = np.setdiff1d(df_test[Columns.Item].unique(), df_train[Columns.Item].unique())

Не работают обертки для ряда моделей

Ряд оберток моделей (LightFM, Implicit) не работает с float32, им необходим float64

В этой строчке sparse матрица переводится во float32.

Варианты решения:

Переводить матрицу взаимодействия в классе Interactions во float64, затем внутри методов _fit моделей переводить в необходимое разрешение.
Оставить разрешение полученное классом Interaction и переводить матрицу взаимодействий в необходимый формат внутри методов _fit моделей.
Переводить матрицу взаимодействия в классе Interactions во float32, затем внутри методов _fit моделей переводить в необходимое разрешение.

Третий вариант самый не оптимальный, т.к. при преводе np.float64 -> np.float32 -> np.float64 будет происходить "округление"

Могу взять этот тикет и повесить пиар.

ALS example notebooks are not reproducible on Colab

What happened?

When code is executed on Colab, kernel dies.

import os
os.environ["OPENBLAS_NUM_THREADS"] = "1"

This is not enough for running implicit library iALS on Colab. User still gets warning about multithreading. It is necessary to do both:

import os
os.environ["OPENBLAS_NUM_THREADS"] = "1"
import threadpoolctl
threadpoolctl.threadpool_limits(1, "blas")

Expected behavior

No response

Relevant logs and/or screenshots

No response

Operating System

Google Colab

Python Version

3.10

RecTools version

0.4.2

Widgets for simultaneous dual metrics analysis

Feature Description

Plotly scatterplot widgets with functionality to select metrics for axes and hue from model parameters

Why this feature?

Great way to find pareto-optimum decisions in case of metrics trade-off

Additional context

No response

Warm/cold/hot scenarios for cross-validation and splitters

Feature Description

Support for different scenarios in our validation pipelines

Why this feature?

Real world challenge

Additional context

No response

Feature Description

Realisation of HitRate metric

Why this feature?

Classic RecSys metric

Additional context

No response

Recommendations intersection metric

Feature Description

Metric to measure intersection in user-item (or item-item) pairs between recommendation lists.

Why this feature?

It helps for both candidate-generators selection in pipelines, and for popularity bias measurement.

Additional context


intersecton = Intersection(k=10)

# one metric
intersecton_value = intersecton.calc(reco=recos, ref_recos=ref_recos)
intersecton_per_user = precision.calc_per_user(reco=recos, ref_recos=ref_recos)

# calc_metrics
calc_metrics(
    metrics = {"precision": precision, "intersection": intersecton},
    reco=recos,
    interactions=df_test,
    ref_recos: Union[pd.Datafame, Dict[Hashable, pd.DataFrame]]=ref_recos
)

We can keep ref_recos a simple pd.DataFrame to calculate intersections with one algorithm.
For multiple intersection calculations we can pass multiple models recommendations in a dict:
ref_recos = {"one": ref_recos_one, "two": ref_recos_two}
Result dict from calc_metrics will have intersection_one and intersection_two keys if ref_recos is a dict (merging keys from metrics and ref_recos.

# cross_validate
cv_results = cross_validate(
    dataset=dataset,
    splitter=splitter,
    models=models,
    metrics=metrics,
    k=10,
    filter_viewed=True,
    ref_models=["one", "two"], # here we just select keys from `models` argument
    validate_ref_models=False  # optionally exclude ref_models from other metrics calculation
)

DebiasWrapper for metrics

Feature Description

A metric wrapper that creates debiased validation in case of strong popularity bias in test data. One way to do this is to fight power-law popularity distribution in test interactions on each fold with down-sampling fold popular items.

Why this feature?

It helps as a correct goal for hyper-parameters tuning and model selection

Additional context

Algorithm to detect and down-sample excessively popular items. More algorithms and modifications can be proposed here. For now we can use IQR (interquartile-range) that is also used for boxplots: logic.

We find first and third quartiles in test items popularity distribution (Q1 and Q3)
IQR = Q3 - Q1. This is interquartile range. 50% of the observed data is inside this range.
Outliers popularity border will be defined as Q3 + iqr_coef * IQR
Maximum accepted popularity will be defined as the maximum value inside the border.
Every item that exceeds the border should be down-sampled to match the maximum accepted popularity.

For all exceeding items in the test fold we need to randomly keep only the maximum allowed subset of users. We use downsampling for this.

The wrapper changes test interactions, but afterwards any metrics can be calculated as usual.

from rectools.metrics import DebiasWrapper, Precision

debiased_precision = DebiasWrapper(Precision(k=10), iqr_coef=1.5, random_state=32)

Other possible namings are: PopDownSamplingWrapper, DownSamplingWrapper, UnbiasedWrapper

Tutorial on how to create a custom model from ModelBase

Feature Description

A jupyter notebook which shows how to create and use a custom recommender model that inherits from ModelBase

Why this feature?

This allows using any models in our pipelines

Additional context

No response

Add random seed tests to all applicable models

Feature Description

Check which models have random seed fixation and write tests for all of them:

models with fixed seed produce same results
models with different fixed seeds produce different results

Why this feature?

Avoid problems with seed fixation

Additional context

No response

Models method to generate a dict of its hyper-params (including wrapped)

Feature Description

Method get_params for all RecTools models. It outputs a dict of all hyper-params available for tuning together with the values of the current instance. Wrapped models params are also added.

Why this feature?

This can be used in validation pipelines for easy-to-use integration with experiments trackers and metrics visualisation that is based on hyper-params.

Additional context

No response

Visual analysis

Add tool for visual analysis of recommendations in Jupyter Notebook

rectools.metrics.calc_metrics method does not work

rectools.metrics.calc_metrics method does not work when trying to run it i got empty result.

code that i tried to run :

from rectools import Columns
from rectools.metrics import Accuracy, NDCG
reco = pd.DataFrame(
{
Columns.User: [1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4],
Columns.Item: [7, 8, 1, 2, 1, 2, 3, 4, 1, 2, 3],
Columns.Rank: [1, 2, 1, 2, 1, 2, 3, 4, 1, 2, 3],
}
)
interactions = pd.DataFrame(
{
Columns.User: [1, 1, 2, 3, 3, 3, 4, 4, 4],
Columns.Item: [1, 2, 1, 1, 3, 4, 1, 2, 3],
Columns.Datetime: [1, 1, 1, 1, 1, 2, 2, 2, 2],
}
)
split_dt = 2
df_train = interactions.loc[interactions[Columns.Datetime] < split_dt]
df_test = interactions.loc[interactions[Columns.Datetime] >= split_dt]
metrics = {
'ndcg@1': NDCG(k=1),
'accuracy@1': Accuracy(k=1)
}
calc_metrics(
metrics,
reco=reco,
interactions=df_test,
prev_interactions=df_train,
catalog=df_train[Columns.Item].unique()
)

Output:
{}

Tutorial for creating custom models from RecTools ModelBase

Feature Description

Jupyter notebook with a tutorial

Why this feature?

It's helpful and great for documentation

Additional context

No response

Support warm and cold users and items

Hot user (item) - present in interactions
Warm - has features, but doesn't present in interactions
Cold - totally new

Now Dataset cannot include warm users (items). And models cannot return reco for users (items) that are not in the dataset. Also models cannot recommend items not from the dataset.

The goal is:

Support warm users (items) in the Dataset
If model by design allows to recommend for warm/cold users (items) or/and allows to recommend warm/cold items, we should implement this in code. If not, we should give an error.

Users parameter for cross-validate

Feature Description

users parameter in cross-validate will accept the list of external user ids to run all experiments only on these users.

Why this feature?

This will help to solve a common task to calculate metrics only on specific subset of users

Additional context

No response

DSSM tutorial

Feature Description

Create a jupyter notebook with explanation of our default model architecture and ways to use our wrapper with different architectures. Special explanation to dataset_type parameter

Why this feature?

DSSMModel is not clear in usage

Additional context

No response

`cross-validate` function

Add a function that accepts dataset, model and splitter and executes cross-validation process

Community guidlines

Add

issue template
pull request template
contribгtion guide
code of conduct

AverageRecommendationPopularity metric

Feature Description

Simple and normalized variants

Why this feature?

Popularity bias metric

Additional context

No response

Windows + python=3.9 is not supported

I can't install rectools on Windows for python=3.9 due to problems with installing nmslib (one of the dependencies). Issue for tracking - nmslib/nmslib#464

Save/load functions to all of the models

Feature Description

Support for saving and loading models in appropriate formats

Why this feature?

It' just super helpful in many cases

Additional context

No response

Add all models scores descriptions to docstrings

Feature Description

Extend models docstrings in the following way:

Add short description of the algorithm if necessary (examples of such short descriptions can be found in our README)
Add description to scores. For example, for ALS model it could be something like: Model scores are dot products of learnt user embedding and recommended item embeddings for user-to-item recommendations. For item-to-item recommendations scores are cosine similarities between target item embedding and recommended item embeddings.

Why this feature?

Helpful for new users

Additional context

No response

Item-to-item meta affinity metric

Feature Description

I2I validation on meta-vectors distances between target item and recommended items

Why this feature?

One of the approaches to item-to-item validation

Additional context

No response

Add missing `Columns.Model` in `reco` processing to `VisualApp`

Feature Description

If Columns.Model is missing in reco, the error is thrown. But we can handle this case by adding default model name

Why this feature?

More convenient for user

Additional context

No response

AttributeError: 'IntegerArray' object has no attribute 'size'

Reproducible example

With pandas==0.25.3

import pandas as pd
from rectools.dataset import IdMap

user_id_map = IdMap.from_values(pd.array([1, 2], dtype=pd.Int32Dtype()))
user_id_map.size
# AttributeError: 'IntegerArray' object has no attribute 'size'

Issue description

The issue seems to be affecting not only pandas==0.25.3, but also pandas==1.0.x (didn't check it thoroughly)
There is no such problem with version pandas==1.1.0 and higher

Expected behaviour

Accessing size of IdMap with IntegerArray type doesn't throw an AttributeError

RecallGain metric

Feature Description

Given test interactions, we can exclude all Positives that would have been recommended by the reference model (e.g. popular model). After that we can calculate Recall as usual. Resulting metrics is interpreted as a gain to overall Recall which is achieved by the new algorithm in comparison with the reference algorithm.

Why this feature?

Useful for both candidate-generators selection and simple validation protocol in case of strong popularity bias in data (e.g. calculating RecallGain from a Popular model).

Additional context

recall_gain = RecallGain(k=10, k_ref=20)

# one metric
value = recall_gain.calc(reco=recos, ref_recos=ref_recos)
per_user = recall_gain.calc_per_user(reco=recos, ref_recos=ref_recos)

# calc_metrics
calc_metrics(
    metrics = {"precision": precision, "intersecton": intersecton, "recall_gain": recall_gain},
    reco=recos,
    interactions=df_test,
    ref_recos: Union[pd.Datafame, Dict[Hashable, pd.DataFrame]]=ref_recos
)
# here we can pass multiple models with:
ref_recos = {"one": ref_recos_one, "two": ref_recos_two}
# result dict will have `intersecton_one`, `intersecton_two`, `recall_gain_one`, `recall_gain_two` as keys

# cross_validate
cv_results = cross_validate(
    dataset=dataset,
    splitter=splitter,
    models=models,
    metrics=metrics,
    k=10,
    filter_viewed=True,
    ref_models=["one", "two"], # just selecting keys from `models` argument
    validate_ref_models=False  # optionally exclude ref_models from other metrics
)

does not work rectools.metrics.ranking.MAP

The method does not work even with the example from the documentation :(

!pip install RecTools

from rectools.metrics.ranking import MAP
from rectools import Columns
import pandas as pd

Columns.Item = 'movie_id'
Columns.User = 'user_id'
Columns.Rank = 'rank'

reco = pd.DataFrame(
{
Columns.User: [1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4],
Columns.Item: [7, 8, 1, 2, 1, 2, 3, 4, 1, 2, 3],
Columns.Rank: [1, 2, 1, 2, 1, 2, 3, 4, 1, 2, 3],
}
)
interactions = pd.DataFrame(
{
Columns.User: [1, 1, 2, 3, 3, 3, 4, 4, 4],
Columns.Item: [1, 2, 1, 1, 3, 4, 1, 2, 3],
}
)

MAP(k=3).calc_per_user(reco, interactions)

File /opt/conda/lib/python3.10/site-packages/rectools/metrics/ranking.py:247, in MAP.calc_per_user(self, reco, interactions)
245 self._check(reco, interactions=interactions)
246 merged_reco = merge_reco(reco, interactions)
--> 247 fitted = self.fit(merged_reco, k_max=self.k)
248 return self.calc_per_user_from_fitted(fitted)

File /opt/conda/lib/python3.10/site-packages/rectools/metrics/ranking.py:192, in MAP.fit(cls, merged, k_max)
189 prec_at_k_csr = sparse.csr_matrix(np.array([]).reshape(0, 0))
190 return MAPFitted(prec_at_k_csr, users, np.array([]))
--> 192 n_relevant_items = merged.groupby(Columns.User, sort=False)[Columns.Item].agg("size")[users].values
194 user_to_idx_map = pd.Series(np.arange(users.size), index=users)
195 df_prepared = merged.query(f"{Columns.Rank} <= @k_max")

KeyError: 'Column not found: movie_id'

SLIM model

Feature Description

Realisation of SLIM: Sparse Linear Methods for Top-N Recommender Systems

Why this feature?

Strong collaborative filtering baseline

Additional context

No response

Fix `Serendipity` and `MIUF` doctring formulas

Feature Description

We need to fix k+1 to k in sum. Also we need to add that formula is correct for one user. For multiple users we need to average results.

Why this feature?

Doc is not 100% correct

Additional context

No response

DSSM default model fix

Feature Description

Right now DSSMModel has one parameter without the default value: dataset_type: TorchDataset[tp.Any]. This is very confusing since DSSMModel also has a default model. But user can't use it out of the box.

Why this feature?

DSSMModel doesn't follow the simple interface from other RecTools models

Additional context

No response

R-Precision variant of `Precision` metrics

Feature Description

Add r_precision: bool = False parameter to Precision metric.
When r_precision is set to True, number of True Positives in K recommendations for user are divided by min(K, test items count for user) instead of simple division by K.

Why this feature?

This variant has better interpretation. https://www.shaped.ai/blog/evaluating-recommendation-systems-part-1

Additional context

No response

EASE model

Feature Description

Realisation of Embarrassingly Shallow Autoencoders for Sparse Data

Why this feature?

Strong baseline collaborative filtering model for interactions-only cases

Additional context

No response

Update `pandas` and `torch` versions

Allow use pandas >= 2.0 and torch >= 2.0

Completeness metric (ratio of provided recos at top k positions)

Feature Description

We need to measure, how many items are actually recommended for users at top k positions out of all possible when Completeness = 100% (every user has all K items recommended).

Why this feature?

Not all models recommend full lists that were required. It is necessary to easily find cases with Completeness less then 100%.

Additional context

Other names might be: Filling or Delivery or Fulfillment or Sufficiency or smth else. But not Coverage. Coverage is mostly used for other cases.

ImportError: cannot import name 'KFoldSplitter' from 'rectools.model_selection'

Добрый день!
В версии 0.3.0 при установке через pip невозможно импортировать KFoldSplitter.

(test) grigoriy@T430:~/Repos/Book_Crossing_RecSys$ python3
Python 3.8.16 (default, Mar  2 2023, 03:21:46) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from rectools.model_selection import KFoldSplitter
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'KFoldSplitter' from 'rectools.model_selection' (/home/grigoriy/anaconda3/envs/test/lib/python3.8/site-packages/rectools/model_selection/__init__.py)
>>>

Если проверить содержимое папки model_selection в рабочей среде, то там отсутствуют файлы:

kfold_split.py
splitter.py
utils.py

ls ~/anaconda3/envs/test/lib/python3.8/site-packages/rectools/model_selection
__init__.py  __pycache__  time_split.py

Для воспроизведения проблемы:

conda create -n test python=3.8
conda activate test
pip3 install rectools
python3
from rectools.model_selection import KFoldSplitter

Версия Python - 3.8.16
Остальные зависимости:

Package            Version
------------------ ---------
attrs              21.4.0
certifi            2022.12.7
charset-normalizer 3.1.0
idna               3.4
implicit           0.4.4
joblib             1.2.0
lightfm            1.17
Markdown           3.2.2
nmslib             2.1.1
numpy              1.24.3
pandas             1.5.3
pip                23.0.1
psutil             5.9.5
pybind11           2.6.1
python-dateutil    2.8.2
pytz               2023.3
rectools           0.3.0
requests           2.29.0
scikit-learn       1.2.2
scipy              1.10.1
setuptools         66.0.0
six                1.16.0
threadpoolctl      3.1.0
tqdm               4.65.0
typeguard          2.13.3
urllib3            1.26.15
wheel              0.38.4

errore install [all]

pip install rectools и pip install rectools[extension-name] установились нормально
а вот :
pip install rectools[all] выдал вот такие ошибки ,
SDK и Visial Studio установлены !

    opt = self.warn_dash_deprecation(opt, section)
  running bdist_wheel
  running build
  running build_ext
  Extra compilation arguments: ['/EHsc', '/openmp', '/O2', '/DVERSION_INFO=\\"2.1.1\\"']
  building 'nmslib' extension
  creating build
  creating build\temp.win-amd64-cpython-310
  creating build\temp.win-amd64-cpython-310\Release
  creating build\temp.win-amd64-cpython-310\Release\similarity_search
  creating build\temp.win-amd64-cpython-310\Release\similarity_search\src
  creating build\temp.win-amd64-cpython-310\Release\similarity_search\src\method
  creating build\temp.win-amd64-cpython-310\Release\similarity_search\src\space
  creating build\temp.win-amd64-cpython-310\Release\tensorflow
  "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.37.32822\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -I.\similarity_search\include -Itensorflow -Ic:\temp\pip-install-b171phqx\nmslib_dde37a94bd0b4873a19978ddf40d8a69\.eggs\pybind11-2.6.1-py3.10.egg\pybind11\include -Ic:\temp\pip-install-b171phqx\nmslib_dde37a94bd0b4873a19978ddf40d8a69\.eggs\pybind11-2.6.1-py3.10.egg\pybind11\include -Ic:\temp\pip-install-b171phqx\nmslib_dde37a94bd0b4873a19978ddf40d8a69\.eggs\pybind11-2.6.1-py3.10.egg\pybind11\include -Ic:\temp\pip-install-b171phqx\nmslib_dde37a94bd0b4873a19978ddf40d8a69\.eggs\pybind11-2.6.1-py3.10.egg\pybind11\include -IC:\Users\TDL\mambaforge\envs\karp\lib\site-packages\numpy\core\include -IC:\Users\TDL\mambaforge\envs\karp\include -IC:\Users\TDL\mambaforge\envs\karp\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.37.32822\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" /EHsc /Tp.\similarity_search\src\distcomp_bregman.cc /Fobuild\temp.win-amd64-cpython-310\Release\.\similarity_search\src\distcomp_bregman.obj /EHsc /openmp /O2 /DVERSION_INFO=\\\"2.1.1\\\"
  distcomp_bregman.cc
  C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.37.32822\include\yvals.h(21): fatal error C1083: ЌҐ г¤ Ґвбп ®вЄалвм д ©« ўЄ«озҐЁҐ: crtdbg.h: No such file or directory,
  error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.37.32822\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for nmslib
Running setup.py clean for nmslib
Failed to build nmslib
ERROR: Could not build wheels for nmslib, which is required to install pyproject.toml-based projects

Cold users/items support for non-personalized models

Add the ability to recommend for cold users and items to models:

PopularModel
PopularInCategoryModel
RandomModel

mobiletelesystems / rectools Goto Github PK

rectools's People

Contributors

Stargazers

Watchers

Forkers

rectools's Issues

Feature Description

Why this feature?

Additional context

What happened?

Expected behavior

Relevant logs and/or screenshots

Operating System

Python Version

RecTools version

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Reproducible example

Issue description

Expected behaviour

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description

Why this feature?

Additional context

Feature Description