Giter Site home page Giter Site logo

pinsage's Introduction

PinSAGE

pinsage for wine recommendation This is the PinSAGE package applied to the wine recommendation system prepared by the 11th Tobigs Conference "투믈리에". It was implemented based on the DGL library and modified to fit the project in the PinSAGE example.

PinSAGE paper: https://arxiv.org/pdf/1806.01973.pdf
DGL: https://docs.dgl.ai/#
DGL PinSAGE example: https://github.com/dmlc/dgl/tree/master/examples/pytorch/pinsage

Requirements

    • dgl
    • dask
    • pandas
    • torch
    • torchtext
    • sklearn

Dataset

Vivino

11,900,000 Wines & 42,000,000 Users User feature: userID, user_follower_count, user_rating_count Item feature: wine_id, body, acidity, alcohol, rating_average, grapes_id

We have a request to share data, so we provide it for you to use in part.

  • 100,000 review data
  • User Metadata
  • Wine Metadata

As much as it's not the entire data, when you learn it yourself, performance may not come out as much as you want. process_wine.py is code that preprocesses collected data for DGL If you use the data provided, please refer to it.

Training model

Nearest-neighbor recommendation

This model recommends wine as Knearst Neighbors for all users. This method finds the center of the embedding vector of the wine consumed by a specific user and recommends the K wines closest to the center vector.

python model.py -d data.pkl -s model -k 500 --eval-epochs 100 --save-epochs 100 --num-epochs 500 --device 0 --hidden-dims 128 --batch-size 64 --batches-per-epoch 512
  • d: Data Files
  • s: The name of the model to
  • k: top K count
  • eval epochs: performance output epoch interval (0 = output X)
  • save epochs: storage epoch interval (0 = storage X)
    • num epochs: epoch 횟수
  • hidden dims: embedding dimension
  • batch size: batch size
    • batches per epoch: iteration 횟수

In addition, there are parameters applied by PinSAGE, so please refer to the model.py code.

Inference

The code at the bottom is a code that explains how to infer, and I will explain the train function part of model.py by excerpt. The performance evaluation method for this project differs from the traditional DGL PinSAGE recommendation of only one item.

Embeddings

Since the model is aimed at learning the embedding of nodes, it is necessary to obtain embeddings of all items and then perform similarity measurements or clustering separately through vector-to-vector operations.

model.py line 159

h_item = evaluation.get_all_emb(gnn, g.ndata['id'][item_ntype], data_dict['testset'], item_ntype, neighbor_sampler, args.batch_size, device)

Obtain all embeddings by receiving node information from the DGL graph object. shape becomes (number of users, embedding size).

model.py line 182~
h_center = torch.mean(h_nodes, axis=0) # central embedding
dist = h_center @ h_item.t() # center embedding * all embeddings -> matrix product
topk = dist.topk(args.k)[1].cpu().extract k in numpy() # dist size order

We average the node embeddings of a particular user for inference to obtain a central embedding vector, and obtain Distance with all embeddings and matrix operations. We extract as many as K embeddings in the order of small distances and present them as final recommendations.

Evaluate with Recall and Hitrate whether the selected items belong to the verification data.

Performance

Model Hitrate Recall
SVD 0.854 0.476
PinSAGE 0.942 0.693

pinsage's People

Contributors

yoonjong12 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

pinsage's Issues

Vivino Data

Is there some place where I can find this Vivino data?

dataset

hi,
thanks for sharing preprocess function.

could you tell me what is the purpose of timestamp column in dataset ?
because in movielens they have timestamp column and they use it to predict most recent users iteractions.
would appreciate your insight on this.

[테스팅] 재구현

안녕하세요, 좋은 repository를 찾아서 연락드리네요.
해당 모델을 제 형식에 맞게 run해보고 싶은데요.
dataset 혹은 movielens로라도 재구현할 수 있는 방법이 있을까요?

question about train/test spllit

Thank you for great work!
I have a question about train/test split. In original dgl process_movielens.py, whole graph g is split by train_test_split_by_time and train_g is created as a subpart of whole graph g, and used as train data. Although, In process_wine.py, Whole graph g is used as train_graph. Is there no warry about accessing test node in train time?

why add bias in pinsage scorer

pinsage/layers.py

Lines 174 to 177 in 2bc4155

def _add_bias(self, edges):
bias_src = self.bias[edges.src[dgl.NID]]
bias_dst = self.bias[edges.dst[dgl.NID]]
return {'s': edges.data['s'] + bias_src + bias_dst}

In this code, the bias act as a learnable parameter. Therefore, the final loss is not only related to vectors.
Can embedding dot product or embedding similarity still be used as a measure of item similarity?

wine Data에 조금 수정이 필요 해 보입니다.

우선, 정말 귀중한 자료와 코드를 공유 해 주신것에 대해 감사합니다.

Data Process 코드인 process_wine.py에 json처리에 조금 수정이 필요해보여서 전달 드립니다.

process_wine.py
users = pd.DataFrame(user_json) -> users = pd.DataFrame(user_json["data"])
items = pd.DataFrame(item_json) -> items = pd.DataFrame(item_json["data"])
columns = ['wine_id', 'name', 'rating_average', 'body', 'acidity', 'alcohol', 'grapes_id'] -> columns = ['wine_id', 'name', 'rating_average', 'body', 'acidity_x', 'alcohol', 'grapes_id']
items['wine_feats'] = list(items[['rating_average', 'body', 'acidity' ,'alcohol']].values) -> items['wine_feats'] = list(items[['rating_average', 'body', 'acidity_x' ,'alcohol']].values)

train,test 자료 중 like가 1이였던 컬럼만 남아있었을거라 추측하여 like관련 컬럼처리 삭제.

process_wine.py 코드 수정 요청

  1. github에 제공된 파일명과 불일치 (wines.json -> wine.json)
  2. items['grapes_id'] = [i[0] for i in items['grapes_id']] -> i[1] grapes id value만 남기는 곳 수정
  3. test = test[colunms] -> test = test[columns] 오타 수정
  4. textual_feature = {'name': items['name'].values} -> 코드 중복

사실 굉장히 사소한 것들이나 수정해주시면 다음에 실행해보실 분들이 더 편리하게 사용하실 수 있을 것 같아서 요청드립니다:)

import os
import re
import json
import pickle
import argparse
import scipy.sparse as ssp
from collections import defaultdict

from tqdm import tqdm
import pandas as pd
import numpy as np
import scipy.sparse as ssp
import dgl
import torch
import torchtext
from builder import PandasGraphBuilder
from data_utils import *

'''
Wine Data Preprocessing
* 와인 데이터를 DGL 프레임워크 입력에 맞게 전처리하는 코드
* 제가 진행한 프로젝트에 맞게 짜여진 코드이므로 참고용으로 사용하시면 좋을 것 같습니다
* 오류가 많을 수 있습니다. Issues로 말씀해주시면 확인하는대로 조치하겠습니다.

* 기준 디렉토리에 아래의 데이터를 준비해둘 것
    * --directory로 기준 디렉토리 입력
    * users.json: 유저 메타데이터
    * wines.json: 와인 메타데이터
    * train.json: train 리뷰 데이터
    * test.json: test 리뷰 데이터

* output_path는 pkl확장자로 저장할 것
'''
print('Processing Started!')

parser = argparse.ArgumentParser()
parser.add_argument('--directory', type=str)
parser.add_argument('--output_path', type=str)
args = parser.parse_args()
directory = args.directory
output_path = args.output_path

# User Data
with open(os.path.join(directory, 'user.json')) as f:
    user_json = json.load(f)
users = pd.DataFrame(user_json["data"])

columns = ['userID', 'user_follower_count', 'user_rating_count']
users = users[columns]
users = users.dropna(subset=['userID'])
users['user_feats'] = list(users[['user_follower_count', 'user_rating_count']].values)

# Wine Data
with open(os.path.join(directory, 'wine.json')) as f:
    item_json = json.load(f)
items = pd.DataFrame(item_json["data"])
items = items.dropna()

columns = ['wine_id', 'name', 'rating_average', 'body', 'acidity_x', 'alcohol', 'grapes_id']
items = items[columns]
items = items.dropna(subset=['wine_id', 'grapes_id'])

items['grapes_id'] = [i[1] for i in items['grapes_id']]
items['wine_feats'] = list(items[['rating_average', 'body', 'acidity_x' ,'alcohol']].values)


# Rating Data
# 8:2
columns = ['userID', 'wine_id', 'rating_per_user']

with open(os.path.join(directory, 'train.json')) as f:
    train = json.load(f)
train = pd.DataFrame(train['data'])

'''
* Like
생각해보면 유저가 좋게 평가하지 않은 아이템을 추천한다는 것이 좋은 선택일까에 대한 고민을 했습니다.
만약 rating에 관계없이 학습하고 싶다면 like 관련 코드는 제거하셔도 좋습니다.

저희는 like에 대한 기준을 rating 3점으로 잡았습니다. 
이 부분은 각자의 판단에 맞게 설정하시면 좋을 것 같습니다.
'''
train['like'] = [1 if x >= 3 else 0 for x in train['rating_per_user']]
train = train[train['like'] == 1]
train = train[columns]

with open(os.path.join(directory, 'test.json')) as f:
    test = json.load(f)
test = pd.DataFrame(test['data'])

test['like'] = [1 if x >= 3 else 0 for x in test['rating_per_user']]
test = test[test['like'] == 1]
test = test[columns]

ratings = pd.concat([train, test], axis=0, ignore_index=True)
user_filter = [k for k, v in ratings['userID'].value_counts().items() if v > 1]
ratings = ratings[ratings['userID'].isin(user_filter)]

# Build Graph
# 아이템, 유저 DB에 존재하는 rating만 사용
user_intersect = set(ratings['userID'].values) & set(users['userID'].values)
item_intersect = set(ratings['wine_id'].values) & set(items['wine_id'].values)

new_users = users[users['userID'].isin(user_intersect)]
new_items = items[items['wine_id'].isin(item_intersect)]
new_ratings = ratings[ratings['userID'].isin(user_intersect) & ratings['wine_id'].isin(item_intersect)]
new_ratings = new_ratings.sort_values('userID')


label = []
for userID, df in new_ratings.groupby('userID'):
    idx = int(df.shape[0] * 0.8)
    timestamp = [0] * df.shape[0]
    timestamp = [x if i < idx else 1 for i, x in enumerate(timestamp)]
    label.extend(timestamp)
new_ratings['timestamp'] = label

# Build graph
graph_builder = PandasGraphBuilder()
graph_builder.add_entities(new_users, 'userID', 'user')
graph_builder.add_entities(new_items, 'wine_id', 'wine')
graph_builder.add_binary_relations(new_ratings, 'userID', 'wine_id', 'rated')
graph_builder.add_binary_relations(new_ratings, 'wine_id', 'userID', 'rated-by')
g = graph_builder.build()


# Assign features.
node_dict = { 
    'user': [new_users, ['userID', 'user_feats'], ['cat', 'int']],
    'wine': [new_items, ['wine_id', 'grapes_id', 'wine_feats'], ['cat', 'cat', 'int']]
}
edge_dict = { 
    'rated': [new_ratings, ['rating_per_user', 'timestamp']],
    'rated-by': [new_ratings, ['rating_per_user', 'timestamp']]
}

for key, (df, features ,dtypes) in node_dict.items():
    for value, dtype in zip(features, dtypes):
        # key = 'user' or 'wine'
        # value = 'user_follower_count' 등등
        if dtype == 'int':
            array = np.array([i for i in df[value].values])
            g.nodes[key].data[value] = torch.FloatTensor(array)
        elif dtype == 'cat':
            g.nodes[key].data[value] = torch.LongTensor(df[value].astype('category').cat.codes.values)

for key, (df, features) in edge_dict.items():
    for value in features:
        g.edges[key].data[value] = torch.LongTensor(df[value].values.astype(np.float32))

# 실제 ID와 카테고리 ID 딕셔너리
user_cat = new_users['userID'].astype('category').cat.codes.values
item_cat = new_items['wine_id'].astype('category').cat.codes.values

user_cat_dict = {k: v for k, v in zip(user_cat, new_users['userID'].values)}
item_cat_dict = {k: v for k, v in zip(item_cat, new_items['wine_id'].values)}

# Label
val_dict = defaultdict(set)
for userID, df in new_ratings.groupby('userID'):
    temp = df[df['timestamp'] == 1]
    val_dict[userID] = set(df[df['timestamp'] == 1]['wine_id'].values)
    
# Build title set
textual_feature = {'name': items['name'].values}

# Dump the graph and the datasets
dataset = {
    'train-graph': g,
    'user-data': new_users,
    'item-data': new_items, 
    'rating-data': new_ratings,
    'val-matrix': None,
    'test-matrix': torch.LongTensor([[0]]),
    'testset': val_dict, 
    'item-texts': textual_feature,
    'item-images': None,
    'user-type': 'user',
    'item-type': 'wine',
    'user-category': user_cat_dict,
    'item-category': item_cat_dict,
    'user-to-item-type': 'rated',
    'item-to-user-type': 'rated-by',
    'timestamp-edge-column': 'timestamp'}

with open(output_path, 'wb') as f:
    pickle.dump(dataset, f)

    
print('Processing Completed!')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.