mobiletelesystems / rectools Goto Github PK
View Code? Open in Web Editor NEWRecTools - library to build Recommendation Systems easier and faster than ever before
License: Apache License 2.0
RecTools - library to build Recommendation Systems easier and faster than ever before
License: Apache License 2.0
Add some functions to load commonly used recommendation datasets like movielens
, lastfm
, kion
, etc.
Think about:
scikit-learn
does?Add the ability to recommend for cold users and items to models:
LightFM
DSSM (warm only)
Copy existing Jupyter notebooks to Google Colab and add links to Readme and docs
add_holdout_fold
parameter for splitters. default is False
run_on_holdout_fold
parameter in cross-validate
. default is False
Holdout validation is an import part of experiments. It's nice ti have it out of the box.
No response
If user_id
and item_id
columns are CategoryDType, then np.setdiff1d works very slowly on large volumes (>10 million unique ones)
Possible solution is to replace:
new_users = set(df_test[Columns.User].unique()) - set(df_train[Columns.User].unique())
And same for
Ряд оберток моделей (LightFM, Implicit) не работает с float32, им необходим float64
В этой строчке sparse матрица переводится во float32.
Варианты решения:
Третий вариант самый не оптимальный, т.к. при преводе np.float64 -> np.float32 -> np.float64
будет происходить "округление"
Могу взять этот тикет и повесить пиар.
When code is executed on Colab, kernel dies.
import os
os.environ["OPENBLAS_NUM_THREADS"] = "1"
This is not enough for running implicit library iALS on Colab. User still gets warning about multithreading. It is necessary to do both:
import os
os.environ["OPENBLAS_NUM_THREADS"] = "1"
import threadpoolctl
threadpoolctl.threadpool_limits(1, "blas")
No response
No response
Google Colab
3.10
0.4.2
Plotly scatterplot widgets with functionality to select metrics for axes and hue from model parameters
Great way to find pareto-optimum decisions in case of metrics trade-off
No response
Support for different scenarios in our validation pipelines
Real world challenge
No response
Realisation of HitRate metric
Classic RecSys metric
No response
Metric to measure intersection in user-item (or item-item) pairs between recommendation lists.
It helps for both candidate-generators selection in pipelines, and for popularity bias measurement.
intersecton = Intersection(k=10)
# one metric
intersecton_value = intersecton.calc(reco=recos, ref_recos=ref_recos)
intersecton_per_user = precision.calc_per_user(reco=recos, ref_recos=ref_recos)
# calc_metrics
calc_metrics(
metrics = {"precision": precision, "intersection": intersecton},
reco=recos,
interactions=df_test,
ref_recos: Union[pd.Datafame, Dict[Hashable, pd.DataFrame]]=ref_recos
)
We can keep ref_recos a simple pd.DataFrame to calculate intersections with one algorithm.
For multiple intersection calculations we can pass multiple models recommendations in a dict:
ref_recos = {"one": ref_recos_one, "two": ref_recos_two}
Result dict from calc_metrics
will have intersection_one
and intersection_two
keys if ref_recos is a dict (merging keys from metrics
and ref_recos
.
# cross_validate
cv_results = cross_validate(
dataset=dataset,
splitter=splitter,
models=models,
metrics=metrics,
k=10,
filter_viewed=True,
ref_models=["one", "two"], # here we just select keys from `models` argument
validate_ref_models=False # optionally exclude ref_models from other metrics calculation
)
A metric wrapper that creates debiased validation in case of strong popularity bias in test data. One way to do this is to fight power-law popularity distribution in test interactions on each fold with down-sampling fold popular items.
It helps as a correct goal for hyper-parameters tuning and model selection
Algorithm to detect and down-sample excessively popular items. More algorithms and modifications can be proposed here. For now we can use IQR (interquartile-range) that is also used for boxplots: logic.
For all exceeding items in the test fold we need to randomly keep only the maximum allowed subset of users. We use downsampling for this.
The wrapper changes test interactions, but afterwards any metrics can be calculated as usual.
from rectools.metrics import DebiasWrapper, Precision
debiased_precision = DebiasWrapper(Precision(k=10), iqr_coef=1.5, random_state=32)
Other possible namings are: PopDownSamplingWrapper, DownSamplingWrapper, UnbiasedWrapper
A jupyter notebook which shows how to create and use a custom recommender model that inherits from ModelBase
This allows using any models in our pipelines
No response
Check which models have random seed fixation and write tests for all of them:
Avoid problems with seed fixation
No response
Method get_params
for all RecTools models. It outputs a dict of all hyper-params available for tuning together with the values of the current instance. Wrapped models params are also added.
This can be used in validation pipelines for easy-to-use integration with experiments trackers and metrics visualisation that is based on hyper-params.
No response
Add tool for visual analysis of recommendations in Jupyter Notebook
rectools.metrics.calc_metrics method does not work when trying to run it i got empty result.
code that i tried to run :
from rectools import Columns
from rectools.metrics import Accuracy, NDCG
reco = pd.DataFrame(
{
Columns.User: [1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4],
Columns.Item: [7, 8, 1, 2, 1, 2, 3, 4, 1, 2, 3],
Columns.Rank: [1, 2, 1, 2, 1, 2, 3, 4, 1, 2, 3],
}
)
interactions = pd.DataFrame(
{
Columns.User: [1, 1, 2, 3, 3, 3, 4, 4, 4],
Columns.Item: [1, 2, 1, 1, 3, 4, 1, 2, 3],
Columns.Datetime: [1, 1, 1, 1, 1, 2, 2, 2, 2],
}
)
split_dt = 2
df_train = interactions.loc[interactions[Columns.Datetime] < split_dt]
df_test = interactions.loc[interactions[Columns.Datetime] >= split_dt]
metrics = {
'ndcg@1': NDCG(k=1),
'accuracy@1': Accuracy(k=1)
}
calc_metrics(
metrics,
reco=reco,
interactions=df_test,
prev_interactions=df_train,
catalog=df_train[Columns.Item].unique()
)
Output:
{}
Jupyter notebook with a tutorial
It's helpful and great for documentation
No response
Hot user (item) - present in interactions
Warm - has features, but doesn't present in interactions
Cold - totally new
Now Dataset
cannot include warm users (items). And models cannot return reco for users (items) that are not in the dataset. Also models cannot recommend items not from the dataset.
The goal is:
Dataset
users
parameter in cross-validate
will accept the list of external user ids to run all experiments only on these users.
This will help to solve a common task to calculate metrics only on specific subset of users
No response
Create a jupyter notebook with explanation of our default model architecture and ways to use our wrapper with different architectures. Special explanation to dataset_type
parameter
DSSMModel is not clear in usage
No response
Add a function that accepts dataset, model and splitter and executes cross-validation process
Add
Simple and normalized variants
Popularity bias metric
No response
I can't install rectools on Windows for python=3.9 due to problems with installing nmslib (one of the dependencies). Issue for tracking - nmslib/nmslib#464
Support for saving and loading models in appropriate formats
It' just super helpful in many cases
No response
Extend models docstrings in the following way:
Model scores are dot products of learnt user embedding and recommended item embeddings for user-to-item recommendations. For item-to-item recommendations scores are cosine similarities between target item embedding and recommended item embeddings
.Helpful for new users
No response
I2I validation on meta-vectors distances between target item and recommended items
One of the approaches to item-to-item validation
No response
If Columns.Model
is missing in reco
, the error is thrown. But we can handle this case by adding default model name
More convenient for user
No response
With pandas==0.25.3
import pandas as pd
from rectools.dataset import IdMap
user_id_map = IdMap.from_values(pd.array([1, 2], dtype=pd.Int32Dtype()))
user_id_map.size
# AttributeError: 'IntegerArray' object has no attribute 'size'
The issue seems to be affecting not only pandas==0.25.3, but also pandas==1.0.x (didn't check it thoroughly)
There is no such problem with version pandas==1.1.0 and higher
Accessing size of IdMap
with IntegerArray
type doesn't throw an AttributeError
Given test interactions, we can exclude all Positives that would have been recommended by the reference model (e.g. popular model). After that we can calculate Recall as usual. Resulting metrics is interpreted as a gain to overall Recall which is achieved by the new algorithm in comparison with the reference algorithm.
Useful for both candidate-generators selection and simple validation protocol in case of strong popularity bias in data (e.g. calculating RecallGain from a Popular model).
recall_gain = RecallGain(k=10, k_ref=20)
# one metric
value = recall_gain.calc(reco=recos, ref_recos=ref_recos)
per_user = recall_gain.calc_per_user(reco=recos, ref_recos=ref_recos)
# calc_metrics
calc_metrics(
metrics = {"precision": precision, "intersecton": intersecton, "recall_gain": recall_gain},
reco=recos,
interactions=df_test,
ref_recos: Union[pd.Datafame, Dict[Hashable, pd.DataFrame]]=ref_recos
)
# here we can pass multiple models with:
ref_recos = {"one": ref_recos_one, "two": ref_recos_two}
# result dict will have `intersecton_one`, `intersecton_two`, `recall_gain_one`, `recall_gain_two` as keys
# cross_validate
cv_results = cross_validate(
dataset=dataset,
splitter=splitter,
models=models,
metrics=metrics,
k=10,
filter_viewed=True,
ref_models=["one", "two"], # just selecting keys from `models` argument
validate_ref_models=False # optionally exclude ref_models from other metrics
)
The method does not work even with the example from the documentation :(
!pip install RecTools
from rectools.metrics.ranking import MAP
from rectools import Columns
import pandas as pd
Columns.Item = 'movie_id'
Columns.User = 'user_id'
Columns.Rank = 'rank'
reco = pd.DataFrame(
{
Columns.User: [1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4],
Columns.Item: [7, 8, 1, 2, 1, 2, 3, 4, 1, 2, 3],
Columns.Rank: [1, 2, 1, 2, 1, 2, 3, 4, 1, 2, 3],
}
)
interactions = pd.DataFrame(
{
Columns.User: [1, 1, 2, 3, 3, 3, 4, 4, 4],
Columns.Item: [1, 2, 1, 1, 3, 4, 1, 2, 3],
}
)
MAP(k=3).calc_per_user(reco, interactions)
File /opt/conda/lib/python3.10/site-packages/rectools/metrics/ranking.py:247, in MAP.calc_per_user(self, reco, interactions)
245 self._check(reco, interactions=interactions)
246 merged_reco = merge_reco(reco, interactions)
--> 247 fitted = self.fit(merged_reco, k_max=self.k)
248 return self.calc_per_user_from_fitted(fitted)
File /opt/conda/lib/python3.10/site-packages/rectools/metrics/ranking.py:192, in MAP.fit(cls, merged, k_max)
189 prec_at_k_csr = sparse.csr_matrix(np.array([]).reshape(0, 0))
190 return MAPFitted(prec_at_k_csr, users, np.array([]))
--> 192 n_relevant_items = merged.groupby(Columns.User, sort=False)[Columns.Item].agg("size")[users].values
194 user_to_idx_map = pd.Series(np.arange(users.size), index=users)
195 df_prepared = merged.query(f"{Columns.Rank} <= @k_max")
KeyError: 'Column not found: movie_id'
Realisation of SLIM: Sparse Linear Methods for Top-N Recommender Systems
Strong collaborative filtering baseline
No response
We need to fix k+1
to k
in sum. Also we need to add that formula is correct for one user. For multiple users we need to average results.
Doc is not 100% correct
No response
Right now DSSMModel has one parameter without the default value: dataset_type: TorchDataset[tp.Any]
. This is very confusing since DSSMModel also has a default model. But user can't use it out of the box.
DSSMModel doesn't follow the simple interface from other RecTools models
No response
Add r_precision: bool = False
parameter to Precision
metric.
When r_precision
is set to True
, number of True Positives in K recommendations for user are divided by min(K, test items count for user) instead of simple division by K.
This variant has better interpretation. https://www.shaped.ai/blog/evaluating-recommendation-systems-part-1
No response
Realisation of Embarrassingly Shallow Autoencoders for Sparse Data
Strong baseline collaborative filtering model for interactions-only cases
No response
Allow use pandas >= 2.0
and torch >= 2.0
We need to measure, how many items are actually recommended for users at top k positions out of all possible when Completeness = 100% (every user has all K items recommended).
Not all models recommend full lists that were required. It is necessary to easily find cases with Completeness less then 100%.
Other names might be: Filling or Delivery or Fulfillment or Sufficiency or smth else. But not Coverage. Coverage is mostly used for other cases.
Добрый день!
В версии 0.3.0 при установке через pip невозможно импортировать KFoldSplitter.
(test) grigoriy@T430:~/Repos/Book_Crossing_RecSys$ python3
Python 3.8.16 (default, Mar 2 2023, 03:21:46)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from rectools.model_selection import KFoldSplitter
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name 'KFoldSplitter' from 'rectools.model_selection' (/home/grigoriy/anaconda3/envs/test/lib/python3.8/site-packages/rectools/model_selection/__init__.py)
>>>
Если проверить содержимое папки model_selection в рабочей среде, то там отсутствуют файлы:
ls ~/anaconda3/envs/test/lib/python3.8/site-packages/rectools/model_selection
__init__.py __pycache__ time_split.py
Для воспроизведения проблемы:
conda create -n test python=3.8
conda activate test
pip3 install rectools
python3
from rectools.model_selection import KFoldSplitter
Версия Python - 3.8.16
Остальные зависимости:
Package Version
------------------ ---------
attrs 21.4.0
certifi 2022.12.7
charset-normalizer 3.1.0
idna 3.4
implicit 0.4.4
joblib 1.2.0
lightfm 1.17
Markdown 3.2.2
nmslib 2.1.1
numpy 1.24.3
pandas 1.5.3
pip 23.0.1
psutil 5.9.5
pybind11 2.6.1
python-dateutil 2.8.2
pytz 2023.3
rectools 0.3.0
requests 2.29.0
scikit-learn 1.2.2
scipy 1.10.1
setuptools 66.0.0
six 1.16.0
threadpoolctl 3.1.0
tqdm 4.65.0
typeguard 2.13.3
urllib3 1.26.15
wheel 0.38.4
pip install rectools и pip install rectools[extension-name] установились нормально
а вот :
pip install rectools[all] выдал вот такие ошибки ,
SDK и Visial Studio установлены !
opt = self.warn_dash_deprecation(opt, section)
running bdist_wheel
running build
running build_ext
Extra compilation arguments: ['/EHsc', '/openmp', '/O2', '/DVERSION_INFO=\\"2.1.1\\"']
building 'nmslib' extension
creating build
creating build\temp.win-amd64-cpython-310
creating build\temp.win-amd64-cpython-310\Release
creating build\temp.win-amd64-cpython-310\Release\similarity_search
creating build\temp.win-amd64-cpython-310\Release\similarity_search\src
creating build\temp.win-amd64-cpython-310\Release\similarity_search\src\method
creating build\temp.win-amd64-cpython-310\Release\similarity_search\src\space
creating build\temp.win-amd64-cpython-310\Release\tensorflow
"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.37.32822\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -I.\similarity_search\include -Itensorflow -Ic:\temp\pip-install-b171phqx\nmslib_dde37a94bd0b4873a19978ddf40d8a69\.eggs\pybind11-2.6.1-py3.10.egg\pybind11\include -Ic:\temp\pip-install-b171phqx\nmslib_dde37a94bd0b4873a19978ddf40d8a69\.eggs\pybind11-2.6.1-py3.10.egg\pybind11\include -Ic:\temp\pip-install-b171phqx\nmslib_dde37a94bd0b4873a19978ddf40d8a69\.eggs\pybind11-2.6.1-py3.10.egg\pybind11\include -Ic:\temp\pip-install-b171phqx\nmslib_dde37a94bd0b4873a19978ddf40d8a69\.eggs\pybind11-2.6.1-py3.10.egg\pybind11\include -IC:\Users\TDL\mambaforge\envs\karp\lib\site-packages\numpy\core\include -IC:\Users\TDL\mambaforge\envs\karp\include -IC:\Users\TDL\mambaforge\envs\karp\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.37.32822\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" /EHsc /Tp.\similarity_search\src\distcomp_bregman.cc /Fobuild\temp.win-amd64-cpython-310\Release\.\similarity_search\src\distcomp_bregman.obj /EHsc /openmp /O2 /DVERSION_INFO=\\\"2.1.1\\\"
distcomp_bregman.cc
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.37.32822\include\yvals.h(21): fatal error C1083: ЌҐ г¤ Ґвбп ®вЄалвм д ©« ўЄ«о票Ґ: crtdbg.h: No such file or directory,
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.37.32822\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for nmslib
Running setup.py clean for nmslib
Failed to build nmslib
ERROR: Could not build wheels for nmslib, which is required to install pyproject.toml-based projects
Add the ability to recommend for cold users and items to models:
PopularModel
PopularInCategoryModel
RandomModel
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.