recsyschallenge / 2017 Goto Github PK

View Code? Open in Web Editor NEW

40.0 8.0 24.0 4.42 MB

Python 100.00%

2017's Introduction

ACM RecSys Challenge 2017

Pointers:

RecSys Challenge 2017
- Details about the challenge: http://2017.recsyschallenge.com
- Submission System of the challenge: https://recsys.xing.com
- News: @recsyschallenge
- Paper: Workshop Summary
RecSys Challenge 2016

2017's People

Contributors

Stargazers

Watchers

2017's Issues

Online target users

Dear RecSys Challenge organizers,

I am interested in the selection of targets for offline evaluation:

Are you planning to restrict the set of recommendable users?
Do the qualified teams receive different set of target users? If not, how the evaluation will handle the possible interference between the recommendation of teams?
What is the expected size of the target users (per team)?

Furthermore, do we receive additional data set generated after the last time of offline training set?

Thank you,
David

Target Users

Hi All!

There are 1,497,020 users in the users catalog. The number of target users is 74840, which is 4.9992% of the total number of users.

How is the list of target users defined? Is it a 5% random sample of the total catalog?
If not, can we know something more specific about the sampling logic?

Thanks,
David

Will target users change after specific intervals?

Will the target users change at weekly intervals, or other specific times or arbitrarily?

Thanks.

Duplicate Users in Online dataset

After recently pulling the dataset, we have found duplicate users in the users.csv file - i.e. user 44 has two entries with different regions, and user 15781851 has different discipline id's

The total amount of users with duplicate entries is roughly 20k.

How can we determine which profile to use?

api/online/data/users - empty file returned

Hi @dkohlsdorf

We are able to download items and interactions for {"current":{"num_items":6373,"updated_at":"2017-05-04T16:47:55.000+02:00"}} update, but users file is empty.

Team Amethyst.

success() function not defined in evaluation

Hi,

In the itemSuccess() function, I believe the success(item, u) method is not defined, or maybe I am not understanding it correctly.
Does it mean that the user u positively interacted (click or bookmark) with the item?
Thanks.

Division by zero in recommendation_worker.py

Hello all,

In recommendation_worker.py line 24 we have :

if x.title_match() > 0

When the result of this check is false, the data[] which was a list declared in line 18 will still remain an empty list and therefore, the second "if" where is in line 29 will also be false and it means the num_evaluated variable which was initialized to zero in line 16 is still zero. That means in line 59 where we have:

score = str(average_score / num_evaluated)

we will have division by zero.

Thanks,
Himan

No interaction_type=5 in updated interactions.

There is no interaction_type=5 in the updated interactions I received today. Why?

Will target users change?

Hi guys.

Looking at the data retrieved from /api/online/data/users, we noticed that the target user are the same for May 1st and 2nd. Can we assume that each team will get the same 50k target users during the whole online part of the challenge or will these change over time?

Thanks.

Question about evaluation metric

I found out something that looks like an inconsistency between your description of the evaluation metric and your pseudocode:

in the pseudocode we can read
if (users.filter(u => success(item, u) > 0).size > 1)
while in the description it says

itemSuccess(item, users) if at least one successful push recommendation was created for a given item then this counts 50 points for paid items and 25 for other items.

Which one should we consider correct?
Thanks in advance

Features details & Evaluation: Title, Tags, Job roles & definition for itemSuccess(...)

#Hello,

I came across several questions after reading the challenge description.

Q1: Is there any direct relation between tags, title or job roles?
Q2: How detailed/domain specific are the existing tags? In e.g. https://www.xing.com/jobs/frankfurt-main-full-stack-developer-frankfurt-75-000-28390260

What tags could one expect to be included? More like:
1. mission, awesome, responsibility, Client, small team, insights;
2. SAAS, Full Stack Developer, software development.
Q3: does itemSuccess(...) is considered during the offline phase? If so, what is a successful push recommendation?

Thank you,
André

Number of target users/items per day in online challenge

Hi.

In the online challenge, new target users and items are released every day. It is also mentioned in the API description that each user can only receive one recommendation.

My question in this context is: How many items and users will there be per day? Ballpark numbers or just the ratio between the two numbers would already be helpful.

An example for why this is important: If we get 100 users and 100 items each day, recommending every item to at least one user becomes a challenge in itself due to the "only one recommendation per user" restriction. So having a general idea about the ratio between users and items per day would be helpful.

Cheers

Maximum score per (user,item) pair

Hello,

I have a question regarding the logic behind the score calculation.
In the challenge website it says:
"At maximum, one can thus earn 52 points (= (1 + 5 + 20 - 0) * 2) per user-item pair."

Does this mean a (user, item) pair can have multiple interaction_type?

How many items each user can receive?

Hello there,

In the challenge description, a limit is set for the number of users that can be chosen for each job (i.e. 100 users per job) but it is not mentioned how many times each user is allowed to be in different lists for different jobs? In other words, how many jobs each user can receive? Do you consider this when you parse our solution files?

Thanks

Do you guys know new interaction distribution has already started? (Moreover, data on interactions.csv are not only ones on the previous day.)

Hello,

We checked interactions.csv that can be obtained by API. We found that the distribution had started and we missed the interactions on April 24th.
Can we get these data?

And we also found that data in interactions.csv are not only ones during the previous day.
Because of this, the file is huge and it spends much time to download it.
Please fix it!

date impression click bookmark apply deny rec_check
20170404 1743 0 0 0 0 0
20170405 2094672 1671 129 303 36 0
20170406 1704179 139 12 7 24 0
20170407 1703492 0 0 0 0 0
20170408 524022 0 0 0 0 0
20170409 524369 0 0 0 0 0
20170410 1835483 0 0 0 0 0
20170411 2097214 0 0 0 0 0
20170412 1965931 0 0 0 0 0
20170413 1572698 0 0 0 0 0
20170414 653940 0 0 0 0 0
20170415 1255 0 0 0 0 0
20170416 261970 0 0 0 0 0
20170417 655422 0 0 0 0 0
20170418 2097656 0 0 0 0 0
20170419 2097493 0 0 0 0 0
20170420 2096749 0 0 0 0 0
20170421 1572958 0 0 0 0 0
20170422 523943 0 0 0 0 0
20170423 655217 0 0 0 0 0
20170424 2097558 0 0 0 0 0
20170425 2093177 2220 123 254 0 0
20170426 0 134 14 18 0 0

Thanks,
team chome

Hadoop logs are downloaded as part of the response when using the API

Hi,
I don't know if we're not supposed to be using the API for the online challenge yet, but I've tried to download the data through it and I've noticed that both the items and interactions endpoints return what seem to be Hadoop logs as part of the response.
Here's an example of trailing text when polling the interactions endopoint, but similar errors appear also in the middle of the response

Is this a problem on our end?
Thanks,
Daniele

difference of users/items between offline stage and online stage

hi, I found that the newly provided items.csv and uers.csv have many overlapped IDs with the offline stage's items.csv and users.csv, however, the profile is totally different. I am confused about these files.
Do you mean the IDs are re-hashed since the online stage, and we should not use the data released in the offline stage?

Timing of target interactions

Dear organizers,

I have a question regarding timing of target interactions.
After you collected training data (interactions.csv) there was some period when you collected evaluation data (interactions with target items). Can we know how long is that period?
Can we assume that target interactions happen immediately after the last day of training interactions (interactions.csv), i.e. there is no any timeframe in between?

Thanks,
Stefan

Questions about the online challenge

Good morning,

Over the course of the past week I have been looking at the rules for the online part of the challenge. I have several questions to ask regarding these rules:

How is the final leaderboard calculated? Will the best 2 daily submissions be considered or will the best 2 weeks be considered, with all the submissions in those 2 weeks?
In the evaluation function for the offline challenge, part of the score is given by the item_success function, where 25 points are awarded per item. Will this item_success bonus be applied also during the online portion?
Regarding day-by-day scoring, will we receive any feedback from our submissions? If so, how frequently will our scores be updated? Also, will there be a leaderboard like in the offline portion?
Will different teams receive the same set or subset of target users and items? If so, can this happen on the same day or only on different days?

Thanks

Clicks for item/user combinations that have no impressions

Hi guys.

When looking at the daily interactions, we noticed that there are some clicks by users for certain items where no impression has been recorded previously. Meaning, the interaction data suggests that some items are clicked by users before an impression has been made. The ratio of item/user combinations for which this is the case is not negligible (about 30% in our case). At this moment, this makes it very problematic to analyze the data with performance indicators, since the calculation of, for example, the conversion rate depends on the accuracy of the impression count.

Is this an error or can this be explained by some user behavior?

Cheers

What's the meaning of 23?

Baseline model does not count a discipline match or industry match if id is 23. Does this number have a special meaning?

file recommendation_worker: minor point

Hi,

Concerning the code below (extract from recommendation worker):

write the results to file

            if len(user_ids) > 0:            
                item_id = str(i) + "\t"
                fp.write(item_id)
                for j in range(0, len(user_ids)):
                    user_id = str(user_ids[j][0]) + ","
                    fp.write(user_id)
                user_id = str(user_ids[-1][0]) + "\n"
                fp.write(user_id)                
                fp.flush()

At the end, you write the same user twice, don't you ?
Indeed user_ids[-1][0] and user_ids[len(user_ids)-1] are the same. And with "j in range(0,len(user_ids))", the last term is j=len(user_ids)-1
So, you should loop for j in range(0, len(user_ids)-1)?

I have rapidly check in my output file (from your program) : the same user appear twice at the end.

BR.

Evaluation-metric

Hello,

I would have a question regarding the evaluation metric. From a quick look at the data I gather users sometimes delete a job after having clicked on it (or even bookmarked it). Am I right in presuming we don't incur a 10 point penalty for them?

Thank you very much!

Best wishes,
Alex

baseline have some problems

Hello,

The baseline code have some problems.
1)
The main one was in the recommendation_worker.py file line 36 the variable num_evaluated was not defined. I assume you mean num_evaluation.
2)
Another problem was that in targetUsers.csv there was a header "user_id" and therefore in the xgb.py file line 69 an error occurs since int("user_id") is not decimal.
3)
And also in recommendation_worker.py line 36, len(average_score) is not defined as average_score is defined as a float variable and it does not have length.

Could you please check the code and run it to see if it generates results correctly.

Thank you very much,
Himan

Online interactions until 01.May

Any hope https://recsys.xing.com/data/online will be updated until 01.May, since the interactions there are cut at 24.April ?

targetItems.csv and targetUsers.csv

Hello there,
So, we have the interactions.csv file which is our training set and the test set is the one that you evaluate our submissions based on that.

My question is about two files: targetItems.csv and targetUsers.csv. Are the items in targetItems.csv the same as the items in the test set? I mean does this file have the exact same items or maybe it has some more or maybe fewer items than the items in the test set. I have the same question about targetUsers.csv.

Thank you very much for your cooperation,
Himan

Doubt about the meaning of anonymized ID of zero

Hello, I find that zero appears in the anonymized IDs of attributes, like discipline_id, industry_id and region. There is no definite description about it. Does zero mean unknown/missing or have specific meaning just like other IDs? It has influence on data preprocessing.
Would you like to add some definite description about it?

Thank you very much,
iFei

Online submission error

Hi , do you closed the online submission API for the moment?
I am testing the online submit API "https://recsys.xing.com/api/submission" however , it always response HTTP Error 500 or 401.
Offline submit API could work still.

baseline details about file "xgb.py"

Hi,

In this file, in github, in line 57, one can read "param['nthread'] = 4" and in line 59 "param['nthread'] = 5".
Is it OK ?

Another question, the comment says, max depth for the tree 6, but in line 56, one can read "'bst:max_depth': 2" ?

Best regards.

There are users who are in interactions.csv but not in users.csv.

Hello,
We checked the files for online evaluation and found that there are many users who are only in interactions.csv (who are not in users.csv).
n(users) in interactions.csv: 691736
n(users) in both interactions.csv and users.csv(for online): 340885
n(users) in both interactions.csv and users.csv(for online and offline): 365160

Should we ignore them or will you give us additional contents file?
In online evaluation, are target users only in users.csv for online?

Thanks,
team chome

can not get the dataset

hello,I am a student ,I create a team in https://recsys.xing.com/, but when I try to download the data, I can not download it . I hope you can give me another link .Thank you !

Program stops working after training step

Hi,

I am facing an issue when running the baseline on a server. It is able to print up to:

....
....
[23] train-rmse:0.348896
[24] train-rmse:0.348896

and then the program stays idle. I believe what's happening is that the processes that are spawned through the 'multiprocessing' library in Python crash immediately after being initiated. Does anybody know how to resolve this?

I do not have this problem when running it on my laptop.

Thanks.

Why is there a 6 weeks period with no interactions in the dataset?

We found this "hole" in the interactions dataset a few days ago, but we couldn't understand if it was intentional or a mistake.

In the plot above we have item ids on the y axis and time on the x axis.
A dot on the plot is an interaction (of type 1, 2, 3 or 4) with item y at time t.

Something similar happens with the users:

And in general (if we only plot interactions at a given time):

Can you give us some more details about this?

Thanks,
Daniele

when will the submited item be pushed to real user?

hi, I am wondering at what time the system will push our submitted file to the end users.
Is it exactly the time when we upload the file? I mean, since I am in UTC +8 time zone, the time I upload my submission file may be the midnight in your country, so definitely the push can not receive well results.

Problem in running baseline code

Hello

When I run the baseline code (xgb.py) I met the error

[10:26:08] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
[0]	train-rmse:0.692811
[1]	train-rmse:0.54524
[2]	train-rmse:0.455807
[3]	train-rmse:0.404835
[4]	train-rmse:0.377351
[5]	train-rmse:0.363124
[6]	train-rmse:0.355944
[7]	train-rmse:0.352372
[8]	train-rmse:0.350607
[9]	train-rmse:0.349739
[10]	train-rmse:0.349311
[11]	train-rmse:0.349101
[12]	train-rmse:0.348998
[13]	train-rmse:0.348947
[14]	train-rmse:0.348922
[15]	train-rmse:0.348909
[16]	train-rmse:0.348903
[17]	train-rmse:0.3489
[18]	train-rmse:0.348899
[19]	train-rmse:0.348898
[20]	train-rmse:0.348897
[21]	train-rmse:0.348897
[22]	train-rmse:0.348896
[23]	train-rmse:0.348896
[24]	train-rmse:0.348896
Traceback (most recent call last):
  File "xgb.py", line 71, in <module>
    target_users += [int(line.strip())]
ValueError: invalid literal for int() with base 10: 'user_id'

dataset CRC checksum error

we are trying several times to get dataset file you provide, but
'interaction.csv' file is broken.
or, is there any another way to get dataset?
I'm using Chrome and windows.
are there any other people having same problem with us?

Differences in dataset schemas

Hi! The following differences in dataset schemas are a tiny source of frustration in automated scripts ...

1- ) Offline users.csv has 14 columns with "wtcj" in column 13. Online users.csv has 13 columns lacking "wtcj".

2-) The order of columns in item.csv are different in all of the three files (offline item.csv, online item.csv (history), online item.csv (last incremental version as of 29.04) )

3-) The order of columns in interactions.csv is different in offline and online versions.

Can you please decide the final schemas as soon as possible ?

Problem with online interactions

Please note that the new interactions.csv file has a wrong format: it contains also columns related to item's information, probably due to a join between interactions.csv and items.csv
Mattia

Dataset zip appears to be corrupted

Hi,

sorry to open an issue about 5 milliseconds after the competition started, but we tried to download the dataset zip on different PCs and OSs and we found that the file seems to be corrupted somehow.
The problem seems to be related to the interactions.csv file, which is cut off at about the 16 millionth line and which weights 400MB once exctracted (1.4 GB compressed).

Is this a problem on our end or does anyone else have this problem?

Thanks,
Daniele

questions about the timeline

Hello , I have two questions,
(1) which time zone do you use for counting each day?
Since the new items will expire after 24 hours, time zone seems to be an extremely important message.
(2) I am not quite understand what this means：
“the final leaderboard score is not calculated over all submissions but over the two best weeks”
could you explain what is "two best weeks"?
Thanks!

Impressions in daily online interactions

Hi guys.

In the online interactions we receive every day, there are also some impressions (interaction_type = 0). However, compared to the number of recommendations we requested to give to users on the previous day, the interactions of the following day do not contain nearly enough impressions (orders of magnitude less). At some point I read that the recommendations would be given out as push notifications. If this is the case, why do they seemingly not reach their targets? Or is this just a problem related to the interactions data?

Thanks.

Duplicate lines in interactions.csv

Hello,

In the data, there seems to be a lot (about 1.10^8) of duplicate interactions [1], i.e. lines that have all fields equal (including timestamp). Should we remove them, or treat them as legit interactions? In the latter case, what does it mean?

Thanks,
Tiphaine.

[1] Found with the command: tail -n +2 interactions.csv |sort -T. -S4g|uniq|wc -l (which yields the number of distinct lines in interactions.csv).

Items are different at online stage

Hi, I found that even at the online stage, items provided in different files are not consistent.
Some items from https://recsys.xing.com/data/online have same ID with items got from /api/online/data/items' , however, their profile are different. How to understand this?

submission api still buggy?

Hi, Although, I sanity-checked the submission file, POST /api/online/submission said it accepted about 1/5th of the recos. I split the file into 5 chunks and resubmitted, it accepted no new submissions. With GET /api/online/submission, I can see the initial submission without loss. So, might this a bug? I am confused.

Start

Access to the online api

Have other teams who submitted the description and NDA been able to see the button on the team page for the online round API and token?
It is still not visible for us? @dkohlsdorf

Team Amethyst.

Mismatch on column name "is_payed"

Hello,
in your description of the dataset of items.csv you are mentioning the field is_paid but in
the dataset it is is_payed.

Maybe you can fix this little mismatch :)

Team membership on XING

I am willing to participate to the challenge with another researcher. We both have created an account on Xing.de. She created a team, but neither her nor me can find a way for me to join the approved team. Does it mean that only one person per team is authorized to submit solutions on the platform?
(I don't think sharing her credentials for the whole Xing platform would be a good practice)

Question about "num_accepted_recos"

Hi, I have a question about num_accepted_recos.
I submitted the file which includes 7316 user-item pairs, then I got the response as below:

{"num_accepted_recos":1476}

However, the submission I got by GET /api/online/submission includes 7316 user-item pairs, and this file's content is same as the file I submitted. (and, each user appears there only once. )

What is this difference ? I misunderstand the meaning of num_accepted_recos (the number of user-item pairs) ?

Thanks,

team chome

Daily online interactions

Hi guys.

Regarding the interaction data we are supposed to retrieve daily via /api/online/data/interactions, we were wondering why the number of recorded interactions is so low compared to the training data we received earlier. It seems like there is a filter based on the target items of the previous day(s). But the same cannot be said for the users. We found some users in the interactions that are not contained in our target users. Does this mean that we also receive interactions for users that are target users of other teams. If so, can we also see what other teams are recommending via the impressions?

Thanks.

recsyschallenge / 2017 Goto Github PK

2017's Introduction

ACM RecSys Challenge 2017

2017's People

Contributors

Stargazers

Watchers

Forkers

2017's Issues

write the results to file

Recommend Projects

Recommend Topics

Recommend Org