Giter Site home page Giter Site logo

Comments (4)

Leavingseason avatar Leavingseason commented on May 26, 2024

Sure. The data was provided by my colleague. I will ask him when he comes tomorrow.
I think it is not complicated, just some operations like reading original MovieLens data with Pandas and then write to a pkl file.

from openlearning4deeprecsys.

xray1111 avatar xray1111 commented on May 26, 2024

@Leavingseason Thanks! That would be a greate help.

from openlearning4deeprecsys.

Leavingseason avatar Leavingseason commented on May 26, 2024

`import time
import numpy as np
from six import next
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import scipy
import pickle
#import _pickle as cPickle
import codecs

def get_100k_data():
df = pd.read_csv(r"\e$\Users\v-fuz\Dataset\FlatFile\Recommendation_Dataset\MovieLens\ml-latest-100k\ratings.csv"
, sep=',', engine='python')
df["rating"] = df["rating"].astype(np.float32)

user_mapping = {}
movie_mapping = {}
index = 0
for x in list(df["userId"].unique()):
    user_mapping[x] = index
    index += 1
index = 0
for x in list(df["movieId"].unique()):
    movie_mapping[x] = index
    index += 1
df["userId"] = df["userId"].map(user_mapping)
df["movieId"] = df["movieId"].map(movie_mapping)
#for col in ("userId", "movieId"):
#    df[col] = df[col].astype(np.int32)

movies = pd.read_csv(r"\e$\Users\v-fuz\Dataset\FlatFile\Recommendation_Dataset\MovieLens\ml-latest-100k\movies.csv"
                 , sep=',', engine='python')
movies["movieId"]= movies["movieId"].map(movie_mapping)
movies = movies.set_index('movieId')
movies["genres"]= movies["genres"].map(lambda x: x.replace('|', ' ').lower())
#vectorizer = CountVectorizer(binary = True)
#vectorizer = vectorizer.fit(list(movies["genres"]))
#movies["genres"]= movies["genres"].map(lambda x: vectorizer.transform([x]))
movie_content = []
index_set = set(movies.index)
for i in range(len(movie_mapping)):       
    if i in index_set:
        movie_content.append(movies.loc[[i]].iloc[0]["genres"])
    else:
        movie_content.append('')
vectorizer = CountVectorizer(binary = True)
movie_content = vectorizer.fit_transform(movie_content)
movie_content = movie_content.astype(np.float32)

users = pd.read_csv(r"\\e$\Users\v-fuz\Dataset\FlatFile\Recommendation_Dataset\MovieLens\ml-latest-100k\tags.csv"
                 , sep=',', engine='python')
users["userId"]= users["userId"].map(user_mapping)
users = users.set_index('userId')
user_content = []
index_set = set(users.index)
for i in range(len(user_mapping)):       
    if i in index_set:
        user_content.append(' '.join(list(users.loc[[i]]["tag"])))
    else:
        user_content.append('')
user_content = vectorizer.fit_transform(user_content)
user_content = user_content.astype(np.float32)

#users = pd.DataFrame(users.groupby('userId')['tag'].agg(lambda x: ' '.join(x)).reset_index(name = "tags"))
#vectorizer = CountVectorizer(binary = True)
#vectorizer = vectorizer.fit(list(users["tags"]))
#users["tags"]= users["tags"].map(lambda x: vectorizer.transform([x]))

df = df.rename(columns={"userId":"user", "movieId":"item", "rating":"rate"})
rows = len(df)
df = df.iloc[np.random.permutation(rows)].reset_index(drop=True)
split_index = int(rows * 0.9)
df_train = df[0:split_index]
df_test = df[split_index:].reset_index(drop=True)



with codecs.open('movielens_100k.pkl', 'wb') as outfile:
    pickle.dump((df_train,df_test,user_content,movie_content), outfile, pickle.HIGHEST_PROTOCOL)

if name == 'main':
get_100k_data()
print("Done!")`

from openlearning4deeprecsys.

xray1111 avatar xray1111 commented on May 26, 2024

Wow! Thanks a lot! @Leavingseason

from openlearning4deeprecsys.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.