rucaibox / recbole-cdr Goto Github PK
View Code? Open in Web Editor NEWThis is a library built upon RecBole for cross-domain recommendation algorithms
License: MIT License
This is a library built upon RecBole for cross-domain recommendation algorithms
License: MIT License
Hi developers,
I am a new user of RecBole-CDR. I follow the tutorial and reproduce the DTCDR method, but got an error.
Here is the overall config:
I modified the eval_args.mode from full
to uni999
, I guss this change cause the error, could you tell me how to fix it?
# general
gpu_id: 0
use_gpu: True
seed: 2022
state: INFO
reproducibility: True
data_path: 'dataset/'
checkpoint_dir: 'saved'
show_progress: True
save_dataset: False
dataset_save_path: ~
save_dataloaders: False
dataloaders_save_path: ~
log_wandb: False
wandb_project: 'recbole_cdr'
# training settings
train_epochs: ["BOTH:300"]
train_batch_size: 4096
learner: adam
learning_rate: 0.0005 #0.001
neg_sampling:
uniform: 1
eval_step: 1
stopping_step: 10
clip_grad_norm: ~
# clip_grad_norm: {'max_norm': 5, 'norm_type': 2}
weight_decay: 0.0
loss_decimal_place: 4
require_pow: False
# evaluation settings
eval_args:
split: {'RS':[0.8,0.1,0.1]}
split_valid: {'RS':[0.8,0.2]}
group_by: user
order: RO
mode: uni999 # full
repeatable: False
metrics: ["Recall","MRR","NDCG","Hit","Precision"]
topk: [10]
valid_metric: MRR@10
valid_metric_bigger: True
eval_batch_size: 409600
metric_decimal_place: 4
Others config here
DTCDR.yaml
embedding_size: 64
base_model: NeuMF
mlp_hidden_size: [64, 64]
dropout_prob: 0.3
alpha: 0.3
dataset config
# dataset config
gpu_id: 0
state: INFO
seed: 2022
field_separator: "\t"
source_domain:
dataset: AmazonBooks
data_path: '/data/home/work/projects/RecBole-CDR/recbole_cdr/dataset_example/'
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
RATING_FIELD: rating
TIME_FIELD: timestamp
NEG_PREFIX: neg_
LABEL_FIELD: label
load_col:
inter: [user_id, item_id, rating]
user_inter_num_interval: "[10,inf)"
item_inter_num_interval: "[10,inf)"
val_interval:
rating: "[3,inf)"
drop_filter_field: True
target_domain:
dataset: AmazonMov
data_path: '/data/home/work/projects/RecBole-CDR/recbole_cdr/dataset_example/'
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
RATING_FIELD: rating
TIME_FIELD: timestamp
NEG_PREFIX: neg_
LABEL_FIELD: label
load_col:
inter: [user_id, item_id, rating]
user_inter_num_interval: "[10,inf)"
item_inter_num_interval: "[10,inf)"
val_interval:
rating: "[3,inf)"
drop_filter_field: True
Hi, in the quick_start file, I can see this line:
train_data, valid_data, test_data = data_preparation(config, dataset)
test_data is stored as a <recbole.data.dataloader.general_dataloader.FullSortEvalDataLoader> object. I am assuming that test_data has the user_id, item_id, and rating for the target domain. How can I read this object as a pandas dataframe** to perform my own evaluation?
Describe the bug
I got this error
To Reproduce
Steps to reproduce the behavior:
Create new env
execute:
!pip install recbole==1.0.1
!pip install recbole-cdr
from recbole_cdr.quick_start import run_recbole_cdr
parameter_dict={
# dataset info
'source_domain': {
'dataset': 'ml-1m',
'data_path': 'dataset/'},
'target_domain': {
'dataset': 'ml-100k',
'data_path': 'dataset/target/',
'user_inter_num_interval': '[5,inf)'},
# other settings
'train_epochs': ['SOURCE:300','TARGET:300','OVERLAP:300']
}
run_recbole_cdr(model='EMCDR', config_dict=parameter_dict)```
- OS: Windows
- RecBole Version [e.g. 0.1.0]
- Python Version 3.8.9
你好,当我试图把训练数据集改为DoubanBook和DoubanMoive的时候报了以下错误:(我按照https://github.com/RUCAIBox/RecBole-CDR/blob/main/results/Douban.md的内容修改properties文件)
<module 'recbole_cdr.data.dataset' from '/data/guzeng/RecBole-CDR/recbole_cdr/data/dataset.py'>
Traceback (most recent call last):
File "run_recbole_cdr.py", line 22, in
run_recbole_cdr(model=args.model, config_file_list=config_file_list)
File "/data/guzeng/RecBole-CDR/recbole_cdr/quick_start/quick_start.py", line 41, in run_recbole_cdr
dataset = create_dataset(config)
File "/data/guzeng/RecBole-CDR/recbole_cdr/data/utils.py", line 72, in create_dataset
dataset = dataset_class(config)
File "/data/guzeng/RecBole-CDR/recbole_cdr/data/dataset.py", line 312, in init
self.source_domain_dataset = CrossDomainSingleDataset(source_config, domain='source')
File "/data/guzeng/RecBole-CDR/recbole_cdr/data/dataset.py", line 31, in init
super().init(config)
File "/data/guzeng/anaconda3/envs/py37/lib/python3.7/site-packages/recbole/data/dataset/dataset.py", line 96, in init
self._from_scratch()
File "/data/guzeng/anaconda3/envs/py37/lib/python3.7/site-packages/recbole/data/dataset/dataset.py", line 106, in _from_scratch
self._load_data(self.dataset_name, self.dataset_path)
File "/data/guzeng/anaconda3/envs/py37/lib/python3.7/site-packages/recbole/data/dataset/dataset.py", line 246, in _load_data
self._download()
File "/data/guzeng/anaconda3/envs/py37/lib/python3.7/site-packages/recbole/data/dataset/dataset.py", line 218, in _download
url = self._get_download_url('url')
File "/data/guzeng/anaconda3/envs/py37/lib/python3.7/site-packages/recbole/data/dataset/dataset.py", line 213, in _get_download_url
f'Neither [{self.dataset_path}] exists in the device'
ValueError: Neither [dataset/DoubanBook] exists in the devicenor [DoubanBook] a known dataset name.
请问RecBole-CDR有相关的使用文档吗?谢谢
各位开发者好,我是从RecBole过来的用户,一些CDR的功能对我来说很香。
我知道该项目仍在初期开发阶段,但是不知道有没有临时的/初步的手册指导一下大致的使用方法。
万分感谢
描述这个 bug
由于field2id_token用于存储某个feature下,项目中的remap_id及其所对应的原始token。dataset.py中,CrossDomainSingleDataset的_remap_fields()中,通过该行代码实现:
self.field2id_token[field_name] = list(map_dict.keys())
但由于map_dict为ChainMap格式,通过.keys()取出其中的token时,顺序并非按照chain中dict的存储顺序。举个例子,针对user_id,顺序应该为先overlap_user再domain_specific_user,而上述代码得到的user_id顺序刚好相反。因此需要修改取出token的方式。
预期
修改取key方式,定义get_keys_from_chainmap_by_order()用于按正序取ChainMap的keys(即原始token):
def get_keys_from_chainmap_by_order(map_dict):
merged_dict = dict()
for dict_item in map_dict.maps:
merged_dict.update(dict_item)
return list(merged_dict.keys())
self.field2id_token[field_name] = get_keys_from_chainmap_by_order(map_dict)
I want to add some feature to SSCDR model. But I'm not able to call the function. I just need to call the function once and do further mapping. I tried every possible way but it gives "index out of bound error" since it falls under iteration.
Please help me out resolving this issue.
Hi, I am a bit new to hyperparameter tuning in Recbole-CDR. I ran the CoNet algorithm on my dataset, and the results seemed to be very poor. For the test dataset I am using, I am getting NDCG@10 values to be around 0.9 on a model that I coded up, so I believe the CoNet values should be around the same range, since CoNet is a strong baseline in Cross-Domain Recommendation. Below are the results I am getting when running CoNet.
INFO test result: OrderedDict([('recall@10', 0.0179), ('mrr@10', 0.0063), ('ndcg@10', 0.0087), ('hit@10', 0.0183), ('precision@10', 0.0018)])
I believe I need to tune the parameters for the model for the numbers to be a lot better. I want to tune the batch size, embedding size, the number of dense layers, the learning rate, and any other parameter that can be tuned. After tuning the hyperparameters, I want to use the best model to make recommendations on the test set.
Please let me know how I can tune the different hyperparameters, and used the best model on the test set to collect the metric values.
Right now, I am using the default values that come with the run_recbole_cdr.py file. Below are the default values that I am running CoNet with:
Evaluation Hyper Parameters:
eval_args = {'group_by': 'user', 'order': 'TO', 'split': {'RS': [0.7, 0.2, 0.1]}, 'mode': 'full'}
repeatable = False
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk = [10]
valid_metric = MRR@10
valid_metric_bigger = True
eval_batch_size = 4096
metric_decimal_place = 4
Other Hyper Parameters:
wandb_project = recbole_cdr
train_epochs = ['BOTH:300']
require_pow = False
embedding_size = 64
reg_weight = 0.01
mlp_hidden_size = [64, 32, 16, 8]
MODEL_TYPE = ModelType.CROSSDOMAIN
MODEL_INPUT_TYPE = InputType.POINTWISE
eval_type = EvaluatorType.RANKING
train_modes = ['BOTH']
epoch_num = ['300']
source_split = False
device = cuda
train_neg_sample_args = {'strategy': 'by', 'by': 1, 'distribution': 'uniform', 'dynamic': 'none'}
eval_neg_sample_args = {'strategy': 'full', 'distribution': 'uniform'}
Source domain: ./comedy_data/comedy
The number of users: 2217
Average actions of users: 16.08528880866426
The number of items: 4977
Average actions of items: 7.16338424437299
The number of inters: 35645
The sparsity of the dataset: 99.6769533176926%
Remain Fields: ['source_user_id', 'source_item_id', 'source_rating', 'source_timestamp']
Target domain: ./action_data/action
The number of users: 2217
Average actions of users: 19.935469314079423
The number of items: 2927
Average actions of items: 15.098086124401913
The number of inters: 44177
The sparsity of the dataset: 99.31921840719268%
Remain Fields: ['target_user_id', 'target_item_id', 'target_rating', 'target_timestamp']
Num of overlapped user: 2217
Num of overlapped item: 1
INFO [Training]: train_batch_size = [2048] negative sampling: [{'uniform': 1}]
INFO [Evaluation]: eval_batch_size = [4096] eval_args: [{'group_by': 'user', 'order': 'TO', 'split': {'RS': [0.7, 0.2, 0.1]}, 'mode': 'full'}]
Hi, i am currently working on a project where I am trying to compare different algorithms together to see if the results are statistically significant. How can I get the HR, NDCG and MRR values for each user instead of getting one value for the results.
For example, if my dataset has 10 users in the test set, how can I retrieve all the 10 MRR,NDCG and HR values for each user instead of an average. As of right now, the model is returning "INFO test result: OrderedDict([('recall@10', 0.0233), ('mrr@10', 0.0098), ('ndcg@10', 0.0125), ('hit@10', 0.025), ('precision@10', 0.0025)])". But, I want each metric for all the users in the dataset.
Please let me know how to get that.
Thank you!
Is your feature request related to a problem? Please describe.
Current RecBole-CDR only supports single-target CDR models. Will you add support for multi-target CDR problem mentioned in this survey in a future release? Multi-target CDR is to improve the recommendation accuracy in all domains simultaneously rather than just one target domain.
Describe the solution you'd like
New dataset, dataloader and trainer supporting multi-target CDR.
Describe alternatives you've considered
I find this part in dataset.py. But I still don't know how to set source_split_flag
and change single-target CDR to multi-target CDR
RecBole-CDR/recbole_cdr/data/dataset.py
Lines 558 to 567 in 1758db4
Hi, I am confused about the user ID preprocessing process:
For cross-domain recommendation (users partially overlap):
Source user id : [1,2,3,4], target user id : [1,2,3,5], the id list is shared.
How could i run the conet.py. I am very confused about
self.source_user_embedding.weight[self.overlapped_num_users: self.target_num_users].fill_(0)
self.source_item_embedding.weight[self.overlapped_num_items: self.target_num_items].fill_(0)
self.target_user_embedding.weight[self.target_num_users:].fill_(0)
self.target_item_embedding.weight[self.target_num_items:].fill_(0)
Hi, I downloaded the Amazon
dataset from here: https://recbole.s3-accelerate.amazonaws.com/CrossDomain/Amazon.zip
The dataset statistics that you report here do not match with what I compute from the original data.
I removed all rows with NaN
s and compute the number of unique values present in the user_id
column in the original .inter
files. This gives the following statistics:
Number of users in AmazonBooks: 687827
Number of users in AmazonMov: 66317
Number of overlapping users: 27516
Am I doing something wrong?
您希望添加的功能是否与某个问题相关?
如何添加.kg、.link或.net?
按照recbole中实现的模型编写rebole-cdr中模型的yaml,无法读取ent_id.
描述您希望的解决方案
是否可以提供一个示例来展示如何在源领域和目标领域下分别加载各自的atomic files(如.kg.link.ent_feature),以及如何preload_weight?
我把模型跑起来后,发现训练的epoch只有1803,但验证的epoch却有466232,这是正常的现象吗?感觉有点违背我的认识,而且数据划分时,我看是8:1:1,但程序跑起来之后,却感觉训练的数据加载器和验证的数据加载器对调了,我基本没有改动过代码,这是什么问题?
Hi, I am trying to calculate the NDCG@10, MRR@10, and HR@10 on my dataset. My dataset is quite small, and some users may not rate up to 10 items in the test set. Usually, we look at the top 10 predictions and check to see if the predictions are in the ground truth for the given user. But, if the user in the test set does not rate that many items, how does RecBole-CDR handle this case?
首先感谢你们能开源这样一个强大的库。在你们的代码你们实现的topn任务中,我想请教一下是否可以将topn任务改成高度预测的任务?就更好了
Traceback (most recent call last):
File "/home/baoyanghao/.pycharm_helpers/pydev/pydevd.py", line 1438, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/baoyanghao/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/baoyanghao/ss/mycode/RecBole-CDR/run_recbole_cdr.py", line 18, in
run_recbole_cdr(model=args.model, config_file_list=config_file_list)
File "/home/baoyanghao/ss/mycode/RecBole-CDR/recbole_cdr/quick_start/quick_start.py", line 43, in run_recbole_cdr
train_data, valid_data, test_data = data_preparation(config, dataset)
File "/home/baoyanghao/ss/mycode/RecBole-CDR/recbole_cdr/data/utils.py", line 87, in data_preparation
dataloaders = load_split_dataloaders(config)
File "/home/baoyanghao/anaconda3/envs/pytorch17/lib/python3.7/site-packages/recbole/data/utils.py", line 78, in load_split_dataloaders
with open(saved_dataloaders_file, 'rb') as f:
TypeError: expected str, bytes or os.PathLike object, not CDRConfig
Process finished with exit code 1
Hi, I'd like to use 2 custom datasets for the source and target domain. I understand that RecBole-CDR builds off of the existing RecBole library, and I was able to use a custom dataset in RecBole to run general recommendation algorithms.
How can I use custom datasets with RecBole-cdr?
(specifically specifying the source and target domain, and choosing the model I want to run).
I want the target domain dataset to be split in order without shuffling. So when I run the algorithm CoNet, for example using the different source domains but the same target domain, I want the train, valid, and test set for the target domain to be the same through the multiple runs. For example, let's say I have three datasets. I make dataset 1 the target domain, and dataset 2 and dataset 3 as the source domains. When I run CoNet on the domain pair of dataset 2 and dataset 1, I want the train, valid, and test set for dataset 1 to be the same as when I run CoNet on the domain pair of dataset 3 and dataset 1. How can I achieve this?
My current Yaml file is below. Is this the correct way to do this, or do I have to add anything else?
source_domain:
seed: 44
gpu_id: "0"
dataset: '../source/data'
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
TIME_FIELD: timestamp
RATING_FIELD: rating
load_col:
inter: [user_id, item_id, rating, timestamp]
embedding_size: 64
user_inter_num_interval: "[0,inf)"
item_inter_num_interval: "[0,inf)"
val_interval:
rating: "[0,inf)"
target_domain:
seed: 44
gpu_id: "0"
dataset: '../target/data'
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
RATING_FIELD: rating
TIME_FIELD: timestamp
eval_args:
group_by: user
order: TO
split: {'RS': [0.7,0.2,0.1]}
mode: full
load_col:
inter: [user_id, item_id, rating, timestamp]
embedding_size: 64
user_inter_num_interval: "[0,inf)"
item_inter_num_interval: "[0,inf)"
val_interval:
rating: "[0,inf)
I would like to run all the 10 CDR algorithms on the Amazon and Douban datasets.
I see that you provided the hyperparameters for each model, and I'd like to use them to get results.
Is it as simple as running:
python run_recbole_cdr.py --model=[model] --dataset=Amazon
If not, How can specify the dataset and run the models with the same hyperparamter configurations that you mentioned?
您好,我是一名使用者,想用recbole-cdr进行跨域CTR任务,需要AUC与logloss做输出,但发现这两个指标输出效果很差。希望寻求参数/模型调整建议。
测试使用的是代码recbole_cdr/dataset_example下的两个数据集(source:ml-1m, target: ml-100k),使用theshold=4过滤标签。不论基础模型是哪个,输出的AUC都在0.6左右。但相同的target数据集使用其他地方的单域模型代码(测试用的deepfm)都能达到AUC>0.75。
我对一些超参数进行过调整(如xx_xx_num_interval, 学习率,valid_metric,甚至theshold=3等),但没有明显提升效果。
下面是我使用的recbole-cdr模型参数,请参考:
1.参数文件sample.yaml:
# dataset config
gpu_id: 0
state: INFO
field_separator: "\t"
use_gpu: True
seed: 2000
reproducibility: True
data_path: 'dataset/'
checkpoint_dir: 'saved'
show_progress: True
save_dataset: False
dataset_save_path: ~
save_dataloaders: False
dataloaders_save_path: ~
log_wandb: False
wandb_project: 'recbole_cdr'
normalize_all: True
# training settings
train_epochs: ["BOTH:300"]
train_batch_size: 2048
learner: adam
neg_sampling:
uniform: 1
eval_step: 1
stopping_step: 10
clip_grad_norm: ~
weight_decay: 1e-3
loss_decimal_place: 6
require_pow: False
# evaluation settings
eval_args:
split: {'RS':[0.8,0.1,0.1]}
group_by: None
mode: labeled
repeatable: False
metrics: ['AUC', 'LogLoss']
valid_metric: AUC
valid_metric_bigger: True
eval_batch_size: 2048
metric_decimal_place: 6
source_domain:
dataset: ml-1m
data_path: 'dataset/'
seq_separator: " "
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
RATING_FIELD: rating
TIME_FIELD: timestamp
NEG_PREFIX: neg_
LABEL_FIELD: label
threshold:
rating: 4
load_col:
inter: [user_id, item_id, rating]
user_inter_num_interval: "[5,inf)"
item_inter_num_interval: "[5,inf)"
val_interval:
rating: "[3,inf)"
drop_filter_field: True
target_domain:
dataset: ml-100k
data_path: 'dataset/'
seq_separator: ","
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
RATING_FIELD: rating
TIME_FIELD: timestamp
NEG_PREFIX: neg_
LABEL_FIELD: label
threshold:
rating: 4
load_col:
inter: [user_id, item_id, rating]
user_inter_num_interval: "[5,inf)"
item_inter_num_interval: "[5,inf)"
val_interval:
rating: "[3,inf)"
drop_filter_field: True
2.python 文件:
import argparse
from recbole_cdr.quick_start import run_recbole_cdr
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--model', '-m', type=str, default='DTCDR', help='name of models')
parser.add_argument('--config_files', type=str, default='sample.yaml', help='config files')
args, _ = parser.parse_known_args()
config_file_list = args.config_files.strip().split(' ') if args.config_files else None
print(config_file_list)
run_recbole_cdr(model=args.model, config_file_list=config_file_list)
embedding_size: 64
base_model: NeuMF
learning_rate: 0.0005
mlp_hidden_size: [64, 64]
dropout_prob: 0.3
alpha: 0.3
感谢您的帮助!
您好,在 Bi-TGCF 实现中,在类的初始化部分,发现指定这些行为 0(应该是不在 source/target 中的用户?),不知道有没有必要呢?
RecBole-CDR/recbole_cdr/model/cross_domain_recommender/bitgcf.py
Lines 59 to 64 in d339918
因为后面紧跟着又来了个参数初始化:
Hi, I am running the CoNet algorithm on two datasets. On some datasets, the algorithm is outputting results, and is working fine. But, on some other cases, I am getting this error:
File "/home/akrish/test-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for CoNet:
size mismatch for source_user_embedding.weight: copying a param with shape torch.Size([7572, 64]) from checkpoint, the shape in current model is torch.Size([8649, 64]).
size mismatch for target_user_embedding.weight: copying a param with shape torch.Size([7572, 64]) from checkpoint, the shape in current model is torch.Size([8649, 64]).
size mismatch for source_item_embedding.weight: copying a param with shape torch.Size([6843, 64]) from checkpoint, the shape in current model is torch.Size([4222, 64]).
size mismatch for target_item_embedding.weight: copying a param with shape torch.Size([6843, 64]) from checkpoint, the shape in current model is torch.Size([4222, 64]).
I am also getting this same error when running the DTCDR, CMF, and CLMF algorithms. I am also using a GPU, so I don't know if that may cause an issue.
My Yaml file looks like this, where the dataset points to the .inter files for each domain....
source_domain:
seed: 44
gpu_id: "0"
dataset: '/home/akrish/fall_2022/dataframes/action_data/action'
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
RATING_FIELD: rating
load_col:
inter: [user_id, item_id, rating]
user_inter_num_interval: "[0,inf)"
item_inter_num_interval: "[0,inf)"
val_interval:
rating: "[0,inf)"
target_domain:
seed: 44
gpu_id: "0"
dataset: '/home/akrish/fall_2022/dataframes/adventure_data/adventure'
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
RATING_FIELD: rating
load_col:
inter: [user_id, item_id, rating]
user_inter_num_interval: "[0,inf)"
item_inter_num_interval: "[0,inf)"
val_interval:
rating: "[0,inf)"
I would appreciate any assistance on this issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.