Giter Site home page Giter Site logo

2022-nips-tenrec's People

Contributors

yuangh-x avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

2022-nips-tenrec's Issues

about dataset

我想请问 OK平台和QB平台 分别是什么 文章中好像没有详细说明,是QQ看点和QQ浏览器吗?

Not found QK-article.csv in TenRec.zip file

Hi, thank you for your guys sharing the large dataset, I have downloaded the TenRec.zip via the link https://drive.google.com/file/d/1R1JhdT9CHzT3qBJODz09pVpHMzShcQ7a/view?usp=sharing. However, when I decompressed the zip file, I only got three files QB-article.csv, QB-video.csv, and QK-video.csv, while the QK-article.csv was not found. I am not sure if I made a mistake or if the file wasn't in the zip package. Would you like to help me check the reason? Very appreciated.

bert4rec的疑问

bert4rec的dev 负样例和评估跟我看pytorch版本的有些不一样,
1.负样例的个数是全量还是没有参与评估?
虽然我看到有一些候选集的代码,但是没有使用
2.模型的预测
这块的模型预测代码是否有

训练和测试数据划分问题

一般情况下,训练集和测试集的划分是按行为发生时间进行划分,比如用1号数据用于训练,使用2号数据做测试集。
但是看代码中训练集和测试集的划分,ctr任务是随机划分,序列推荐任务是把用户的倒数第二个行为作为valid集,最后一个行为作为test集合。不是严格按照行为发生时间的先后顺序。
这样划分会不会不太好?

gender值分别表示?

您好,我想请问一下,在QK-video.csv等文件中gender属性的值似乎仅包含整数值0-2,这些值分别表示?

BERT4Rec部分数据处理问题

main函数中用train_val_test_split函数将原数据集分为train_data, val_data, test_datatrain_data 取的是序列的[: -2], val_data取的是序列的[-2: -1], test_data 取的是序列的[-1: ],那么在进行验证的时候以[: -2]为特征,[-2: -1]为ground truth,Build_full_EvalDataset函数(utils.py 第1158行)中的seq = self.u2seq[user][:-1]是否应该为seq = self.u2seq[user][:]

以及在进行测试的时候是否应该以序列的[: -1]为特征, [-1: ]为ground truth,即在main.py(第90行)构建test_datasetBuild_full_EvalDataset 函数的第一个参数 应该为原序列的[: -1], 而不仅仅是 [: -2]

esmm多任务的方法我没看出来pctcvr的计算

你好最近我在学习多任务推荐,我看到esmm的方法是通过pcvr*pctr计算出pctcvr,并对pctr和pctcvr进行损失计算训练网络,但是在esmm的模型中我只看出来计算了pctr和pcvr,好像并没有计算pctcvr?

The number of interactions in ctr_data_1M.csv is different from that reported in the paper.

Hi, author! Thank you for your contribution!
I have a question. Why is the number of interactions in ctr_data_1M.csv different from that reported in the paper? I load the ctr_data_1M.csv and find 120342306 interactions in it. But the number of interactions reported in the paper is 86642580. Do I need to filter some interactions? What strategy did you use to filter interactions? Hope to your reply!

timestamp

请问论文中描述的tenrec的timestamp怎么没有在csv中看到

Session-based Recommendation测试为什么取scores的mean

Session-based Recommendation中
假设seq input为(v1,v2,v3,...,vT),模型输出tensor形状为(B,T, V),V为全集物品的数目
在训练的时候, 每个时间步对应的输出tensor意义是可以对应为预测下一时刻的物品概率,采用了CE loss作为损失。
但在测试的时候,而是将所有时间步的输出取了平均(对应代码:trainer.py里502行的scores = scores.mean(1))作为概率,而不是取最后时刻(score[:,-1])作为输出,这样是不是和训练的时候不一致了呢?

age属性的值的含义

您好,我想请问一下,在QK-video.csv、sbr_data_1M.csv等文件中age属性的值似乎仅包含整数值0-5,这些值分别表示哪个年龄段呢?

数据集中用户性别存在错误!

  1. 对于QK-article和QK-video中,QK-article存在一个用户多个性别的情况;QK-ariticle和QK-video重叠用户的性别无法对应
  2. QB-article和QB-video问题同上
    以下是代码验证我描述的问题
    image

image

复现论文结果

我用gen_ctr.py产生了100K用户数据,运行ESSM任务。
python main.py --task_name=mtl --seed=100 --model_name=esmm --dataset_path="ctr_data_100000.csv" --train_batch_size=4096 --val_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.0001 --embedding_size=32 --mtl_task_num=2
我发现最好click/like的AUC只有“0.605,0.710”
Epoch 18 train loss is 0.736, click auc is 0.605 and like auc is 0.710

远远低于论文中的ESMM 0.7940 0.9110

想请问下可能是地方需要调整。
附件是我产生的数据。

ctr_data_100000.zip

数据集是否可以处理后进行公开

非常感谢您们发布的这么高质量的数据集!有个问题想问一下:如果我在论文中使用了该数据集,那我“在论文的github库中提供处理后(处理具体是指user id和item id的remap,item过滤)的数据集的下载”是否会违反使用协议呢?

再次感谢,期待您的回复!

数据集使用范围?

请问下该数据集的使用范围是什么?

针对条款没太特别清楚数据集的下载和使用范围,是否可以用于论文写作中充当模型训练,测试这样的用途呢?
还是说不能用于发表论文,只能用于自己学习使用?

感谢你的回答!

国内数据集链接可以放一个吗?

国内需求还是蛮大的吧?在国内放一个5G数据集的可下载链接,应该也不麻烦?所以可以先暂时放一个?和你们现在做的网站 又不冲突

复现论文结果

您好,我直接复用了下载数据集中的ctr1M数据,按照正负样本比1:2采样,8:1:1切分,在我自己实现的xdeepfm模型上的效果auc为0.899,远高于论文中的结果,想问一下原论文中提供的参考结果是否训练达到最优结果?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.