muricoca / crab Goto Github PK

Crab is a ﬂexible, fast recommender engine for Python that integrates classic information ﬁltering recommendation algorithms in the world of scientiﬁc Python packages (numpy, scipy, matplotlib).

Home Page: http://muricoca.github.com/crab

License: Other

Shell 1.35% Python 98.65%

crab's People

Contributors

Stargazers

Watchers

Forkers

orygens douglas ionekr ga2arch dakerfp chenshouyuan joskid cloudappsetup heytong alexlin88 xuanhan863 talesp chrisgilmerproj elinaldosoft sys520084 pthinker ciphor ebottabi thiagocoroa ranjithtenz jeppe levonxxl panisson manmadewind perryhau nwf5d treper hezila lxlsosi cojito sungoak darjeeling eldraco webker litaoshao ramhemasri wnyc qqwjq fangang samuela ankitmustcode assad2008 intery89 hijbul manishiitj shelleyklop go4fun wy51r tomekla invinciblejha nvdnkpr wangeek brandonkane matheper rainyear maheedhargunturu setogit merijokul xl9211 ballacky13 singhman machinelearner sibghatullahsheikh henbow yangxu1222 mzhang001 jmizgajski ethanhu zatopek8848 gchen tc2680 sailingwood aeroflow adam23 hupili viktorija ahna stoneyang-ai qilinxo furongpeng fandres70 layhuang pwaila faisal-w japaks spencerx xuyong timedcy torasonic icaicai mrvege roant thinkgandhi miraculixx hhlsakura loisaidasam sxfmol cenwei binweiwu improper4

crab's Issues

Develop the Content-Based Filtering Recommender.

Develop the Content Based Filtering Recommender based on this Article;

http://irserver.ucd.ie/dspace/bitstream/10197/1893/1/sp145-phelan.pdf

Develop the DictBooleanPreferenceDataModel

Develop the Data model which support boolean preferences instead of having the value for each preference for an item ID.

This model represent the presence of abscense of the item ID, that is, the association can have one of two values: exists, or doesn’t exist.

{ user_id: [ item_id, item_id2, ..] , user_id2: [item_id1, item_id2, item_id3, ...] }

Investigate how to pass the Rescorer Function as parameter to Recommend Base Recommender

It is necessary to pass a rescorer function to BaseRecommender's recommend method. How will it pass this:

Options discussed:

a) Decorators

b) Functions as parameters

c) ??

Implement the Evaluation techniques

Implement the recommender evaluation techniques: MAE, RMSE, F1-Score, Precision and Recall.

Evaluating collaborative filtering recommender systems
References:
http://dl.acm.org/citation.cfm?id=963772

Evaluation of recommender systems: new approach

http://ec.iem.cyut.edu.tw/drupal/sites/default/files/Evaluation%20of%20recommender%20systems%20A%20new%20approach.pdf

Add a new module for Similarities Metrics

Build a new module similar to cluster.metrics.pairwise that will hold all similarity metrics used in the recommenders.

Update the Wiki Page Developer Resources.

Update the wiki page Developer Resources to help new contributors to the Crab.

Develop the User-Based- Collaborative Filtering Recommender

Develop the User-based Collaborative Filtering Recommender using KNN Nearest Neighbor.

Put the references here:

Herlocker, J., Konstan, J., Borchers, A., & Riedl, J.. An algorithmic framework for performing collaborative filtering. In
Proceedings of international ACM SIGIR conference on
research and development in information retrieval, Berkeley,
California . New York: ACM

Fix most_similar_user_ids to not fetch nan values in similarities.

To add:

def most_similar_users(self, user_id, how_many=None):
'''
Return the most similar users to the given user, ordered
from most similar to least.

    Parameters
    -----------
    ...


    return np.array([to_user_id for to_user_id, pref in similarities \
        if user_id != to_user_id and not np.isnan(pref)])

Implement Cross Validation Techniques for Evaluate the Recommenders

Check and implement several cross validation techniques for evaluate the recommenders.

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_validation.py

Develop the Item-Based Collaborative Filtering Recommender

Base Articles:

Item-based collaborative filtering recommendation algorithms by Sarwar
http://portal.acm.org/citation.cfm?id=372071

Amazon.com Item-to-Item collaborative Filtering

http://www.disco.ethz.ch/lectures/fs10/seminar/paper/michael-2.pdf

Include self.__set_params in the recommend BaseRecommender method

It is necessary to set the params in the recommend BaseRecommender method by calling self.__set_params(**params) from BaseEstimator.

How to put this in the BaseRecommender without calling super in the Child Class?

Create the FileDataModel

Create the FileDataModel that will receive as input the *.txt or *.csv or any text file and parse it and store it as internal structure.

Use a sample database to test the FileDataModel.

Work with the DictDataModel Assigned developer to choose the best internal structure to store the user matrix ratings/preferences.

Develop the Item-Based Similarity

ItemSimilarity that extends Similarity)

Returns the degree of similarity, of two items, based on its preferences by the users.
Implementations of this class define a notion of similarity between two items. 
Implementations should  return values in the range 0.0 to 1.0, with 1.0 representing 
perfect similarity.

most_similar_users to Boolean Model does not work propperly

The method most_similar_users does not work properly with boolean models since the similarity returns 0.0 when the similarity is null different from ratings model which returns np.nan

def most_similar_users(self, user_id, how_many=None):
'''
Return the most similar users to the given user, ordered
from most similar to least.

    Parameters
    -----------
   ....
    return np.array([to_user_id for to_user_id, pref in similarities \
        if user_id != to_user_id and not np.isnan(pref) and pref != 0.0])

Develop the MatrixBooleanPreferenceDataModel

Develop a Extended class of MatrixPreferenceDataModel that will support boolean matrixes (presence or abscense of the item).

Create a module to manage the datasets

Check the 'scikits.learn' datasets.brunch to build a specific module for use with the recommender dataset.

Criar um MemoryModel

Um model baseado memoria que extende BaseModel.

Preocupar-se com a estrutura de dados a ser utilizada para ser o máximo otimizada.

Investigate the optimization of the metrics_pairwise.py

Investigate the optimization of the metrics_pairwise.py Check if it is the best solution for the metrics.

Criar um BaseModel

Criar um BaseModel:

Seguir os padrões:

def UserIDs(self):
    '''
    Return all user IDs in the model, in order
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def PreferencesFromUser(self,userID,orderByID=True):
    '''
    Return user's preferences, ordered by user ID (if orderByID is True) 
    or by the preference values (if orderById is False), as an array.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def ItemIDsFromUser(self,userID):
    '''
    Return IDs of items user expresses a preference for 
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def ItemIDs(self):
    '''
    Return a iterator of all item IDs in the model, in order
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def PreferencesForItem(self,itemID,orderByID=True):
    '''
    Return all existing Preferences expressed for that item, 
    ordered by user ID (if orderByID is True) or by the preference values 
    (if orderById is False), as an array.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def PreferenceValue(self,userID,itemID):
    '''
    Retrieves the preference value for a single user and item.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def PreferenceTime(self,userID,itemID):
    '''
    Retrieves the time at which a preference value from a user and item was set, if known.
    Time is expressed in the usual way, as a number of milliseconds since the epoch.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def NumUsers(self):
    '''
    Return total number of users known to the model.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def NumItems(self):
    '''
    Return total number of items known to the model.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def NumUsersWithPreferenceFor(self,*itemIDs):
    '''
    Return the number of users who have expressed a preference for all of the items
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def setPreference(self,userID,itemID,value):
    '''
    Sets a particular preference (item plus rating) for a user.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def removePreference(self,userID, itemID):
    '''
    Removes a particular preference for a user.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def convertItemID2name(self, itemID):
    """Given item id number return item name"""
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def convertUserID2name(self, userID):
    """Given user id number return user name"""
    raise NotImplementedError("cannot instantiate Abstract Base Class") 


def hasPreferenceValues(self):
    '''
    Return True if this implementation actually it is not a 'boolean' DataModel.
    Otherwise returns False.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")


def MaxPreference(self):
    '''
    Return the maximum preference value that is possible in the current problem domain being evaluated.
    For example, if the domain is movie ratings on a scale of 1 to 5, this should be 5. While  a recommender
    may estimate a preference value above 5.0, it isn't "fair" to consider that the system is actually
    suggesting an impossible rating of, say, 5.4 stars.
    In practice the application would cap this estimate to 5.0. Since evaluators evaluate
    the difference between estimated and actual value, this at least prevents this effect from unfairly
    penalizing a Recommender.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")


def MinPreference(self):
    '''
    Returns the minimum preference value that is possible in the current problem domain being evaluated
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

Change the name of the repository and the name of the organization

We will change the project to be in a repository with a better name. The muricoca labs will be changed since we will have a new partner in the project.

We will also change the name of the organization for a better one to reflect the new commiters to this project.

Return of Recommenders with with_preference = True is all strings.

When calls from Recommenders the methods to compute the recommendations or similarities the return is coming all string:

For instance:
#1. Catch the nearest neighbors based on Friends.
neighborhood = NearestNeighborsStrategy()

    #2. Followers Similarity
    model = MongoBooleanUserDataModel('followers')
    similarity = UserSimilarity(model, jaccard_coefficient)

    f_model = MongoBooleanUserDataModel('friends')

    #2.Recommend Friends based on Friends Similarity.
    recsys_followers = SocialBasedRecommender(f_model, similarity, neighborhood)

    user_id = 'ricardocaspirro'
    assert_array_equal(np.array([[u'luciananunes', u'0.5'], [u'brunomelo', u'0.5']]),
                recsys_followers.recommend(user_id, 4))

You can see the return above all string (0.5 , 0.5) the correct output would be both as float values.

Remove the dependacy of Scikit.Learn

The reason is that is comming with several issues to install scikit learn ad dependancy.

it will be removed and will be placed as an optional plugin integration.

_top_matches in Recommenders with wrong values at preferences

In _top_matches, the preferences is using the wrong value. It is necessary to change to the correctly one.

def _top_matches(self, source_id, target_ids, how_many=None, **params):
   ....
    #Empty target_ids
    if target_ids.size == 0:
        return np.array([])

    estimate_preferences = np.vectorize(self.estimate_preference)

    preferences = estimate_preferences(source_id, target_ids)

    preferences_values = preferences[~np.isnan(preferences)]

    target_ids = target_ids[~np.isnan(preferences)]

Develop the Matrix Factorization Recommender System

Based on the paper of Sarwar :

Sarwar, B.; Karypis, G.; Konstan, J.; Riedl, J. (2000), Application of Dimensionality Reduction in Recommender System A Case Study

and

NCREMENTAL SINGULAR VALUE. DECOMPOSITION ALGORITHMS FOR. HIGHLY SCALABLE RECOMMENDER. SYSTEMS (SARWAR ET AL)

Investigate the implementation of the Graph Based Recommender

Instereting approach for recommenders using the Graph Theorem:

http://markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-engine/

Develop the ContentSimilarity

It will use the library open-source:

https://github.com/piskvorky/gensim

as background for generating the similarities.

Develop the MongoDB/MySQL Data Model

Develop a MongoDB/MYSQL (it will be decided which one it will be used) DataModel.

Problems at printing the model

When the model is loaded with user_ids and item_ids as integers, the repr method throws an Exception.

It only works with strings as models.

Implement the representation as string (repr) of the DataModels

The current data models lack the representation as strings by using repr.

The goal is when you do print model , it will show a brief representation of the ratings matrix (based on the current ones) such as:

print model
MatrixDataModel (3 by 3)
red orange green
apple 2.000000 --- 1.000000
orange --- 2.000000 ---
celery --- --- 1.000000

Deacoplate the Neighborhood in Recommenders

The Neighborhood strategies are too acoplated so you can't use different models easily.

Develop the SparseMatrixPreferenceDataModel

Develop the Sparse Matrix Representation extending the current MatrixPreferenceDataModel using the package:

http://docs.scipy.org/doc/scipy/reference/sparse.html

PS: you should take a look at this:

https://github.com/piskvorky/gensim/blob/develop/gensim/similarities/docsim.py

Estimate_preference is not working propperly with boolean models

Estimate_preference is not working properly with boolean models.

def estimate_preference(self, user_id, item_id, **params):

    preference = self.model.preference_value(user_id, item_id)

    if not np.isnan(preference) and preference != 0.0:
        return preference

preference comes with 0.0, and it is considered a non-preference.

Optimizations at Crab - Similarity and Model

Optimization to work with medium-size datasets in memory.

Implement the User Based Similarity

UserSimilarity that extends Similarity interface:

Returns the degree of similarity, of two users, based on the their preferences.
Implementations of this class define a notion of similarity between two users. 
Implementations should  return values in the range 0.0 to 1.0, with 1.0 representing 
perfect similarity.

Build the BaseRecommender

Build the BaseRecommender that will extend the BaseEstimator from Scikit.learn

This module will hold the main methods for recommendation process.

def recommend(self,userID,howMany,rescorer=None)

def estimatePreference(self,**args)

def allOtherItems(self,userID)

def setPreference(self,userID,itemID,value)

def removePreference(self,userID,itemID)

Investigate the main methods to be placed in this module.

Develop the MatrixPreferenceDataModel

Investigating how to use the representation by matrix with Scipy and Numpy representations instead of a dict.

Update the Write-Docs Wiki Page

It is necessary to update the Write-Docs Wiki Page.

https://github.com/muricoca/crab/wiki/How-to-write-docs

In metrics/test_pairwise.py remove warnings

Fix the warnings in the test_pairwise.py (Investigate).

Marcel-2:tests marcelcaraciolo$ python test_pairwise.py
/Users/marcelcaraciolo/Desktop/muricoca/crab/crab/scikits/crab/metrics/pairwise.py:425: RuntimeWarning: invalid value encountered in divide
return np.dot(X,Y.T) / np.dot(np.sqrt(XX) ,np.sqrt(YY).T)
...../Users/marcelcaraciolo/Desktop/muricoca/crab/crab/scikits/crab/metrics/pairwise.py:147: RuntimeWarning: invalid value encountered in divide
return np.divide(num,den)
....

Fix the execution of all tests in crab

There is an issue related to the execution of the tests that is not recognizing the imports of the main classes.

Support open zip and gzip files on FileDataModel

It is important to FileDataModel to support opening zip and gzip files.

Include the MovieLens dataset to the project

Include the MovieLens in datasets.

http://www.grouplens.org/node/73

Investigate the SVD Recommender with users with no preferences returns NaN

The recommendations with Matrix Factorization (SVD) for a user with non-preferences, if it will recommend something or nothing in this case. Investigate in the paper.

Create the DictDataModel

Develop the DictDataModel that extends the BaseDataModel.

This will store all the inputs that comes as Dictionary

{'user_id': { 'item_id': Rate, 'item_id2': Rate3,...}, 'user_id2: {... }}

in the Data Model in memory.

Investigate the best way to store in memory all the user x user matrix data.

Create the BaseSimilarity

A BaseSimilarity interface for using similarity between items or users.

class Similarity(object):
"""
Similarity Class - for similarity searches over a set of items/users.

In all instances, there is a data model against which we want to perform the
similarity search.

For each similarity search, the input is a item/user and the output are its
similarities to individual items/users.

Similarity queries are realized by calling ``self[query_item]``.
There is also a convenience wrapper, where iterating over `self` yields
similarities of each object in the model against the whole data model (ie.,
the query is each item/user in turn).


""" 

def __init__(self,model,distance,numBest=None):
    """ The constructor of Similarity class 

    `model` defines the data model where data is fetched.

    `distance` The similarity measured (function) between two vectors.

    If `numBest` is left unspecified, similarity queries return a full list (one
    float for every item in the model, including the query item).

    If `numBest` is set, queries return `numBest` most similar items, as a
    sorted list.


    """
    self.model = model
    self.distance = distance
    self.numBest = numBest


def getSimilarity(self,vec1,vec2):
    """
    Return similarity of a vector `vec1` to a specific vector `vec2` in the model.
    The vector is assumed to be either of unit length or empty.

    """
    raise NotImplementedError("cannot instantiate Abstract Base Class")


def getSimilarities(self,vec):
    """

    Return similarity of a vector `vec` to all vectors in the model.
    The vector is assumed to be either of unit length or empty.

    """
    raise NotImplementedError("cannot instantiate Abstract Base Class")


def __getitem__(self,vec):
    """
    Get similarities of a vector `vec` to all items in the model
    """
    allSims = self.getSimilarities(vec)

    #return either all similarities as a list, or only self.numBest most similar, depending on settings from the constructor

    if self.numBest is None:
        return allSims
    else:
        tops = [(label, sim) for label, sim in allSims]
        tops = sorted(tops, key = lambda item: -item[1]) # sort by -sim => highest sim first
        return tops[ : self.numBest] # return at most numBest top 2-tuples (label, sim)

Add Recommender Evaluator

Add the implementation of Recommender Evaluator which will evaluate the recommender metrics using the cross-validation functions available for the data set.

Print the recommendations with the strings

Currently we use the user ids and item ids at showing the recommendations.
The idea is to develop a method to show the string from the user_ids and item_ids.

This task can be done using a method in the model that can show those strings.

def userid2string(self, user_id):

def item_id2string(self, item_id):

error in scikits.crab setup command: Distribution contains no modules or packages for namespace package 'scikits'

My installation attempts included

pip install git+git://github.com/muricoca/crab.git

and

pip install -e git+git://github.com/muricoca/crab.git#egg=crab

Develop the Home-Page for the Crab Project

Develop the new home page for the Crab Project.

It will be hosted using muricoca.github.com/crab and crab.muricoca.com