Giter Site home page Giter Site logo

guydada / mind-recommender-system-project-pytorch-tf-idf--deep-learning Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 18.43 MB

A reccomeneder engine using Microsoft MIND project

License: MIT License

Python 8.50% Jupyter Notebook 91.50%
machine-learning deep-learning nuer data-science search-engine reccomendersystem recommender-system tfidf pytorch python3

mind-recommender-system-project-pytorch-tf-idf--deep-learning's Introduction

title


MIND - Deep Learning Recommendation System


python pytorch version


Authors:


Submission:


Tel-Aviv University

Requirements

  • Python 3.7.6
  • PyTorch 1.10
  • Anaconda V4.11.0
  • MIND-Dataset (small or full)

Installation

Clone the repository, and run the following command (in the root directory):

$ conda create --name <env> --file requirements.txt

The data should be downloaded and placed in the project directory. Create a data directory, and train test directories, place the data from: download accordingly.

Usage

After installation, you can run the following command to activate the environment:

$ conda activate <env>

Run the following command:

$ python main.py

For a full list of commands, run:

$ python main.py --help

After running the code, a prompt will appear asking for mode, choose between:

  • 'tfidf' - TF-IDF mode
  • 'model' - Model mode

Additional Files and Directories

  • notebooks - Jupyter Notebooks - for visual aids creation and EDA
  • docs - PPT conatining a short presentation
  • README.md - The current document, serving as the final report for this project

Abstract

This project is a final submission project in the course: "Introduction to Search, Information Retrieval and Recommender Systems".The dataset used is the MIND dataset by Microsoft. The dataset contains articles from the Microsoft News and Microsoft Blog websites. In addition, the behavior of users (over 2.5 million users) is also included. The dataset is divided into train, validation and test sets - we will discuss the matter of this division in the next sections.

Furthermore, a "SMALL-MIND" dataset was supplied as well, we chose to use it as our main set for training and testing due to time and computational constraints.

Our main challenge was figuring out how to represent large amount of text data in a numerical form. We used TF-IDF due to its simplicity and efficiency, applying 2 main approaches for recommendation:

  • A clean TF-IDF approach - Using cosine similarity
  • A hybrid approach - using TF-IDF, content based filtering and collaborative filtering

In both approaches, we defined the baseline for the recommendations - the most popular articles by clicks. We have implemented different metrics for evaluating the quality of our recommendations, mostly we chose to use nDCG score in order to have a comparable result the existing MIND projects.

Coding Standards

The code is written in Python 3.7.6, and is divided into the following modules:

  • main.py - The main file, which contains the main function.
  • mind.py - The file that contains the MIND dataset classes.
  • models.net_models.py - The file that contains the neural network models.
  • models.utils - The file that contains the utility functions - tensorize.py, load_preprocess.py, evaluate.py.

We have taken a big effort to try and withstand the following:

  • PEP8
  • OOP principles
  • Documentation
  • Simplicity
  • Minimal code duplication using inheritance and composition
  • Version control is done using Git.

Define the Problem

The problem we are trying to solve is to recommend articles to users based on their behavior and the articles they have read and interacted with. This seemingly simple problem is actually a huge challenge in the field of recommendation. We think this dataset is especially interesting because it gives us the opportunity to explore both collaborative and content based filtering approaches.

Additional challenges include:

  • How to create the features?
  • How much of the data is relevant?
  • How to evaluate the results?

Data

News Articles

First, let's see some basic exploration of the data. The categories are: categories

The data itself (referring as mentioned to SMALL-MIND) contains 51282 unique articles, including their categories, subcategories, abstracts, and content. For a full review of the data's structure, please refer to the README in Microsoft News repository.

Content and Text

When applying TF-IDF, we must first "clean" the input data. We have implemented methods for easily choosing the wanted columns from the dataset for preprocessing, and then they go through the following steps:

  • Remove all the punctuation
  • Remove all the stopwords
  • Remove all the numbers
  • Stem the words
  • Lemmatize the words

With all the above, the definition of a "stop word" and "punctuation" is from nltk library.

Behaviors

The data contains the behavior of users (over 2.5 million users) and the articles they have read and interacted with. The SMALL-MIND contains 50,000 unique users who interacted only with articles that appear in the MIND-SMALL news dataset.

The data is built in a way that presents users and their respective "session" of reading and interacting with articles, we chose to represent every user as a collection of his complete history combined, while a method for transforming the data to "sessions" mode is implemented inside the main Mind class.

Data Train-Test Split

The data is split by Microsoft News into train and test sets. The split was done based on collecting data for 4 weeks for training and 4 weeks for testing.

Undersampling

We noticed early on that most users just don't interact with articles. Therefore, we decided to undersample the users behaviors. Below is the distribution of clicking in the dataset:

img_3.png

Determining the Baseline

A major question arises:

How to validate the quality of our recommendations? especially with TF-IDF approach?

We chose to use nDCG score1 , while the baseline is the most popular articles by clicks.

Pure TF-IDF

For user representation, we took all the news that appear a user's history, convert them into vectors and calculate the mean vector (a csr-matrix)2 . After that we convert all the "impressions" news to csr-matrices.
join it the user average vector and preform a cosine similarity metric to estimate the distances of each impression news to user average vector. We sort all the distances in descending order.

graph LR
A[User] -- History --> B(Mean History Vector)
A --Impressions--> C(Impressions Vectors)
B --> D{Cosine Similarity}
C --> D
D --> h(TF-IDF Score)
h --> k(User Labels)
k --> j(TF-IDF nDCG Score)
G(User Click Score) --> k 
k --> i(User Click nDCG Score)
i<-->j
j<--compare-->i 
Loading

Performance Metrics

In order to perform nDCG, we compared the TF-IDF top-k with the user's top-k by clicks. We use the actual behavior of the user - using the labels of the impressions.

Results - Using TF-IDF on 'title' Only

The results are presented below:

  • nDCG - baseline,0.390468
  • nDCG - tfidf,0.421797
  • nDCG%5 - baseline,0.146443
  • nDCG%5 - tfidf,0.186489
  • nDCG%10 - baseline,0.186622
  • nDCG%10 - tfidf,0.221536

img12312.png

Overall, the performance of TF-IDF is better than the baseline. We also tried to include other columns into the TF-IDF, for example: Using TF-IDF on 'title' and 'Category':

tfidf-results

as seen above, the category does not contribute much to the TF-IDF score. We also tried to add the abstract:

image

Adding the abstract does not improve the results, it actually decreases the performance.


Deep Learning Approach

In this project, we used PyTorch library to build a neural network model. Coming to solve this challenge, We noticed that the main challenge is the representation of users and articles as features. This is complex matter that we have spent a lot of time to understand, research and implement. If not adjusted, the TF-IDF matrix can be as large as 280,000 and more. First, we had to limit the TF-IDF matrix using min_df and max_df parameters.

Next, we came up with 2 strategies for building the features for the neural network.

  1. Tensor multiplication of the mean history vector and the impression vector - dimensions = max_features
  2. Tensor dot product of the mean history vector and the impression vector - dimensions = 1
flowchart LR
A[User] --> B(User Vectors)
C[Impressions] --> D(Impressions Vectors)
B --> E{Vector Multiplication}
D --> E{Vector Multiplication}
B --> e{Tensor Dot Product}
D --> e{Tensor Dot Product}
E --n=max_features--> f(User Features)
e --n=1--> F(User Features)
Loading

Model Architecture

We chose to implement a fully connected neural network with 1 hidden layers. The hidden layer has 64 neurons. The next layer is a ReLU activation function, and the output layer is a linear activation function. The output is transformed to a probability distribution using the softmax function. this architecture is rather simple and was not derived of complex approach, rather than a result of our reading and research.

Performance Metrics

Currently, our model supports the standard metrics for evaluating the performance of classification models. We used torch.data.DataLoader to load the data in batches, and yet the size of the data is very big. Due to time constraints, we limited the number of epochs to 100 for the first approach and only 10 for the second. In the near future, we plan to implement nDCG metrics for the model, using the probabilities given from the neural network.

Results

Tensor Multiplication

This approach resulted in almost no learning. The model was not capable of learning, as shown in the figure below: img.png As shown, the loss is not converging. We believe this is due to the fact that many users have no relation at all to the shown representation of the articles - therefore it contains many zero values, causing the whole feature vector to be zero. img_1.png

Tensor Dot Product

This approach was our latest, and we have yet made to make it much better - although the loss might be converging while training. This leads us to believe that with prolonged training, the model will be able to learn better. loss2.png

This does not seem much better, yet it was trained for only 10 epochs.

img_2.png

Conclusions and Next Steps

It is very hard to evaluate the performance of the model since the learning was not converging. Therefore, we don't have this type of comparison between the two approaches, the TF-IDF is clearly better currently. The results are a mixed bag. On one hand, the results of the pure TF-IDF are decent. On the other hand, the results of the model are very bad. We consider the following conclusions:

  • Representation of users history vectors from TF-IDF might not be good for Deep Learning of the type we applied.
  • A more advanced user embedding might be better, and should be done in comparison with the TF-IDF approach.
  • We should apply nDCG metrics to the model, to have a better understanding of the results vs pure TF-IDF.

Footnotes

  • 1: While currently implemented fully for TF-IDF, we will be implementing it for hybrid approach.
  • 2: We use The word "vector" interchangeably, practically speaking, about csr-matrices.

References

mind-recommender-system-project-pytorch-tf-idf--deep-learning's People

Contributors

labuser5 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.