Giter Site home page Giter Site logo

dsc-4-39-03-singular-value-decomposition-numpy-scipy-lab-nyc-career-ds-102218's Introduction

Singular Value Decomposition with Numpy and SciPy

Introduction

In this lab, we will build a basic version of low-rank matrix factorization engine for recommending movies in using Numpy and SciPy. The purpose of this lesson to help develop a sound understanding of how this process works in detail, before we implement it in PySpark. We will use a small movie recommendation dataset suitable for such experiments. The system should make not highly accurate, yet meaningful recommendations.

Objectives

  • Build a basic recommendation system using MovieLense 1 million movies dataset
  • Use Scipy and Numpy to build a recommendation system using matrix factorization.

Dataset

For this lab, we will use a dataset of 1 million movie ratings available from the MovieLens project collected by GroupLens Research at the University of Minnesota. The website offers many versions and subsets of the complete MovieLens dataset. We have downloaded the 1 million movies subset for you and you can find it under the folder ml-1m. Visit this link and also the MovieLens site above to get an understanding of format for included files before moving on:

  • ratings.dat
  • users.dat
  • movies.dat

Let's first read our dataset into pandas dataframe before moving on.

Our datasets are .dat format with features split using the delimiter '::'. Perform following tasks:

  • Read the files ratings.dat, movies.dat and users.dat using python's open(). Use encoding='latin-1' for these files

  • Split above files on delimiter '::' and create arrays for users , movies and ratings

  • Create ratings and movies dataframes from arrays above with columns:

    • ratings = ['UID', 'MID', 'Rating', 'Time']
    • movies = ['MID', 'Title', 'Genre']
  • View the contents of movies and ratings datasets

Note: Make sure to change the appropriate datatypes to int (numeric) in these datasets.

# Code here
# Uncomment Below

# print(movies.head())
# print()
# print(ratings.head())
   MID                               Title                         Genre
0    1                    Toy Story (1995)   Animation|Children's|Comedy
1    2                      Jumanji (1995)  Adventure|Children's|Fantasy
2    3             Grumpier Old Men (1995)                Comedy|Romance
3    4            Waiting to Exhale (1995)                  Comedy|Drama
4    5  Father of the Bride Part II (1995)                        Comedy

   UID   MID  Rating       Time
0    1  1193       5  978300760
1    1   661       3  978302109
2    1   914       3  978301968
3    1  3408       4  978300275
4    1  2355       5  978824291

Creating the Utility Matrix

Matrix factorization, as we saw in the previous lesson, uses a "Utility Matrix" of users x movies. The intersection points between users and movies indicate the ratings users have given to movies. We saw how this is mostly sparse matrix. Here is a quick refresher below:

Next, our job is to create such a matrix from the ratings table above in order to proceed forward with SVD.

  • Create a utility matrix A to contain one row per user and one column per movie. Pivot ratings dataframe to achieve this.
# Create a utility matrix A by pivoting ratings.df

# Code here
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
MID 1 2 3 4 5 6 7 8 9 10 ... 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952
UID
1 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 3706 columns

Finally, let's perform mean normalization on our utility matrix, and save it as a numpy array for decomposition tasks.

# Mean normalize dataframe A to numpy array A_norm


# Code here 
array([[ 4.94009714, -0.05990286, -0.05990286, ..., -0.05990286,
        -0.05990286, -0.05990286],
       [-0.12924987, -0.12924987, -0.12924987, ..., -0.12924987,
        -0.12924987, -0.12924987],
       [-0.05369671, -0.05369671, -0.05369671, ..., -0.05369671,
        -0.05369671, -0.05369671],
       ...,
       [ 3.85429034, -0.14570966, -0.14570966, ..., -0.14570966,
        -0.14570966, -0.14570966],
       [ 4.89314625, -0.10685375, -0.10685375, ..., -0.10685375,
        -0.10685375, -0.10685375],
       [ 4.55477604,  4.55477604, -0.44522396, ..., -0.44522396,
        -0.44522396, -0.44522396]])

Matrix Factorization with SVD

SVD can help us decomposes the matrix $A$ into the best lower rank approximation of the original matrix $A$. Mathematically, it decomposes $A$ into a two unitary matrices and a diagonal matrix:

$$\begin{equation} A = U\Sigma V^{T} \end{equation}$$

A above is users' ratings matrix (utility), $U$ is the user "features" matrix, $\Sigma$ is the diagonal matrix of singular values (essentially weights), and $V^{T}$ is the movie "features" matrix. $U$ and $V^{T}$ are orthogonal, and represent different things. $U$ represents how much users "like" each feature and $V^{T}$ represents how relevant each feature is to each movie.

To get the lower rank approximation, we take these matrices and keep only the top $k$ features, which we think of as the underlying tastes and preferences vectors.

Perform following tasks:

  • Import svds from Scipy
  • decompose A_norm using 50 factors i.e. pass k=50 argument to svds()
# Code here 

Creating diagonal matrix for sigma factors

The sigma above is returned as a 1 dimensional array of latent factor values. As we need to perform matrix multiplication in our next part, lets convert it to a diagonal matrix using np.diag(). Here is an explanation for this

  • Convert sigma factors into a diagonal matrix with np.diag()
  • Check and confirm the shape of sigma befora and after conversion
# Code here
[ 147.18581225  147.62154312  148.58855276  150.03171353  151.79983807
  153.96248652  154.29956787  154.54519202  156.1600638   157.59909505
  158.55444246  159.49830789  161.17474208  161.91263179  164.2500819
  166.36342107  166.65755956  167.57534795  169.76284423  171.74044056
  176.69147709  179.09436104  181.81118789  184.17680849  186.29341046
  192.15335604  192.56979125  199.83346621  201.19198515  209.67692339
  212.55518526  215.46630906  221.6502159   231.38108343  239.08619469
  244.8772772   252.13622776  256.26466285  275.38648118  287.89180228
  315.0835415   335.08085421  345.17197178  362.26793969  415.93557804
  434.97695433  497.2191638   574.46932602  670.41536276 1544.10679346] (50,)





(array([[ 147.18581225,    0.        ,    0.        , ...,    0.        ,
            0.        ,    0.        ],
        [   0.        ,  147.62154312,    0.        , ...,    0.        ,
            0.        ,    0.        ],
        [   0.        ,    0.        ,  148.58855276, ...,    0.        ,
            0.        ,    0.        ],
        ...,
        [   0.        ,    0.        ,    0.        , ...,  574.46932602,
            0.        ,    0.        ],
        [   0.        ,    0.        ,    0.        , ...,    0.        ,
          670.41536276,    0.        ],
        [   0.        ,    0.        ,    0.        , ...,    0.        ,
            0.        , 1544.10679346]]), (50, 50))

Excellent, We changed sigma from a vector of size fifty to a 2D diagonal matrix of size 50x50. We can now move on to making predictions from our decomposed matrices.

Making Predictions from the Decomposed Matrices

Now we have everything required to make movie ratings predictions for every user. We will do it all at once by following the math and matrix multiply $U$, $\Sigma$, and $V^{T}$ back to get the rank $k=50$ approximation of $A$. Perform following tasks to achieve this

  • Use np.dot() to multiply $U,\Sigma, V$
  • add the user ratings means back to get the actual star ratings prediction.
# Code here 

For a practical system, the value of k above would be identified through creating test and training datasets and selecting optimal value of this parameter. We will leave this bit for our detailed experiment in the next lab.

Here, we'll see how to make recommendations based on predictions array created above.

Making Recommendations

With the predictions matrix for every user, we can build a function to recommend movies for any user. We need to return the movies with the highest predicted rating that the specified user hasn't already rated. We will also merge in the user information to get a more complete picture of the recommendations. We will also return the list of movies the user has already rated, for the sake of comparison.

  • Create a Dataframe from predictions and view contents
  • Use column names of A as the new names for this dataframe
# Code here 
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
MID 1 2 3 4 5 6 7 8 9 10 ... 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952
0 4.288861 0.143055 -0.195080 -0.018843 0.012232 -0.176604 -0.074120 0.141358 -0.059553 -0.195950 ... 0.027807 0.001640 0.026395 -0.022024 -0.085415 0.403529 0.105579 0.031912 0.050450 0.088910
1 0.744716 0.169659 0.335418 0.000758 0.022475 1.353050 0.051426 0.071258 0.161601 1.567246 ... -0.056502 -0.013733 -0.010580 0.062576 -0.016248 0.155790 -0.418737 -0.101102 -0.054098 -0.140188
2 1.818824 0.456136 0.090978 -0.043037 -0.025694 -0.158617 -0.131778 0.098977 0.030551 0.735470 ... 0.040481 -0.005301 0.012832 0.029349 0.020866 0.121532 0.076205 0.012345 0.015148 -0.109956
3 0.408057 -0.072960 0.039642 0.089363 0.041950 0.237753 -0.049426 0.009467 0.045469 -0.111370 ... 0.008571 -0.005425 -0.008500 -0.003417 -0.083982 0.094512 0.057557 -0.026050 0.014841 -0.034224
4 1.574272 0.021239 -0.051300 0.246884 -0.032406 1.552281 -0.199630 -0.014920 -0.060498 0.450512 ... 0.110151 0.046010 0.006934 -0.015940 -0.050080 -0.052539 0.507189 0.033830 0.125706 0.199244

5 rows × 3706 columns

Now we have a predictions dataframe composed from a reduced factors. We can now can predicted recommendations for every user. We will create a new function recommender() as shown below:

  • recommender()

    • Inputs: predictions dataframe , chosen UserID, movies dataframe, original ratings df, num_recommendations
    • Outputs: Movies already rated by user, predicted ratings for remaining movies for user
  • Get and set of predictions for selected user and sort in descending order

  • Get the movies already rated by user and sort them in descending order by rating

  • Create a set of recommendations for all movies, not yet rated by the user

# Recommending top movies not yet rated by user
def recommender(predictions_df, UID, movies_df, original_ratings_df, num_recommendations=5):
    
    # Get and sort the user's predictions
    user_row = None # UID starts at 1, not 0
    sorted_predictions = None
    
    # Get the original user data and merge in the movie information 
    user_data = None 
    user_full = None

    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = None
                    
    # Print user information
    
    
    pass #return user_full, recommendations

Using above function, we can now get a set of recommendations for any user.

# Get a list of already reated and recommended movies for a selected user
# Uncomment to run below


#rated, recommended = recommender(predictions_df, 100, movies, ratings, 10)
User 100 has already rated 76 movies.
Recommending highest 10 predicted ratings movies not already rated.
# Uncomment to run 

# rated.head(10)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
UID MID Rating Time Title Genre
1 100 800 5 977593915 Lone Star (1996) Drama|Mystery
63 100 527 5 977594839 Schindler's List (1993) Drama|War
16 100 919 5 977594947 Wizard of Oz, The (1939) Adventure|Children's|Drama|Musical
17 100 924 4 977594873 2001: A Space Odyssey (1968) Drama|Mystery|Sci-Fi|Thriller
29 100 969 4 977594044 African Queen, The (1951) Action|Adventure|Romance|War
22 100 2406 4 977594142 Romancing the Stone (1984) Action|Adventure|Comedy|Romance
47 100 318 4 977594839 Shawshank Redemption, The (1994) Drama
20 100 858 4 977593950 Godfather, The (1972) Action|Crime|Drama
49 100 329 4 977594297 Star Trek: Generations (1994) Action|Adventure|Sci-Fi
50 100 260 4 977593595 Star Wars: Episode IV - A New Hope (1977) Action|Adventure|Fantasy|Sci-Fi
# Uncomment to run

# print ("\nTop Ten recommendations for selected user" )
# recommended
Top Ten recommendations for selected user
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
MID Title Genre
1311 1374 Star Trek: The Wrath of Khan (1982) Action|Adventure|Sci-Fi
1148 1193 One Flew Over the Cuckoo's Nest (1975) Drama
1312 1376 Star Trek IV: The Voyage Home (1986) Action|Adventure|Sci-Fi
285 296 Pulp Fiction (1994) Crime|Drama
570 590 Dances with Wolves (1990) Adventure|Drama|Western
1184 1240 Terminator, The (1984) Action|Sci-Fi|Thriller
877 912 Casablanca (1942) Drama|Romance|War
1161 1214 Alien (1979) Action|Horror|Sci-Fi|Thriller
1524 1617 L.A. Confidential (1997) Crime|Film-Noir|Mystery|Thriller
997 1036 Die Hard (1988) Action|Thriller

For above randomly selected user 100, we can subjectively evaluate that the recommender is doing a decent job. The movies being recommended are quite similar in taste as movies already rated by user. Remember this system is built using only a small subset of the complete MovieLense database which carries potential for significant improvement in predictive performance.

Level Up - Optional

  • Run the experiment again using validation testing to identify the optimal value of rank k

  • Create Test and Train datasets to predict and evaluate the ratings using a suitable method (e.g. RMSE)

  • How much of an improvement do you see in recommendations as a result of validation ?

  • Ask other interesting questions

Additional Resources

Summary

In this lab, we learned that we can make good recommendations with collaborative filtering methods using latent features from low-rank matrix factorization methods. This technique also scales significantly better to larger datasets. We will next work with a larger MovieLense dataset and using the mapreduce techniques seen in the previous section, we will use PySpark to implement a similar approach using an implementation of matrix factorization called Alternating Least Squares (ALS) method.

dsc-4-39-03-singular-value-decomposition-numpy-scipy-lab-nyc-career-ds-102218's People

Contributors

shakeelraja avatar

Watchers

James Cloos avatar Kevin McAlear avatar  avatar Victoria Thevenot avatar Belinda Black avatar  avatar Joe Cardarelli avatar Sam Birk avatar Sara Tibbetts avatar The Learn Team avatar Sophie DeBenedetto avatar  avatar Jaichitra (JC) Balakrishnan avatar Antoin avatar Alex Griffith avatar  avatar Amanda D'Avria avatar  avatar Nicole Kroese  avatar  avatar Lore Dirick avatar Nicolas Marcora avatar Lisa Jiang avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.