Giter Site home page Giter Site logo

muricoca / crab Goto Github PK

View Code? Open in Web Editor NEW
1.2K 1.2K 375.0 4 MB

Crab is a flexible, fast recommender engine for Python that integrates classic information filtering recommendation algorithms in the world of scientific Python packages (numpy, scipy, matplotlib).

Home Page: http://muricoca.github.com/crab

License: Other

Shell 1.35% Python 98.65%

crab's People

Contributors

bradfora avatar brunojm avatar chrisgilmerproj avatar douglas avatar earl avatar fcurella avatar marcelcaraciolo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crab's Issues

Develop the DictBooleanPreferenceDataModel

Develop the Data model which support boolean preferences instead of having the value for each preference for an item ID.

This model represent the presence of abscense of the item ID, that is, the association can have one of two values: exists, or doesn’t exist.

{ user_id: [ item_id, item_id2, ..] , user_id2: [item_id1, item_id2, item_id3, ...] }

Develop the User-Based- Collaborative Filtering Recommender

Develop the User-based Collaborative Filtering Recommender using KNN Nearest Neighbor.

Put the references here:

Herlocker, J., Konstan, J., Borchers, A., & Riedl, J.. An algorithmic framework for performing collaborative filtering. In
Proceedings of international ACM SIGIR conference on
research and development in information retrieval, Berkeley,
California . New York: ACM

Fix most_similar_user_ids to not fetch nan values in similarities.

To add:

def most_similar_users(self, user_id, how_many=None):
'''
Return the most similar users to the given user, ordered
from most similar to least.

    Parameters
    -----------
    ...


    return np.array([to_user_id for to_user_id, pref in similarities \
        if user_id != to_user_id and not np.isnan(pref)])

Create the FileDataModel

Create the FileDataModel that will receive as input the *.txt or *.csv or any text file and parse it and store it as internal structure.

Use a sample database to test the FileDataModel.

Work with the DictDataModel Assigned developer to choose the best internal structure to store the user matrix ratings/preferences.

Develop the Item-Based Similarity

ItemSimilarity that extends Similarity)

Returns the degree of similarity, of two items, based on its preferences by the users.
Implementations of this class define a notion of similarity between two items. 
Implementations should  return values in the range 0.0 to 1.0, with 1.0 representing 
perfect similarity.

most_similar_users to Boolean Model does not work propperly

The method most_similar_users does not work properly with boolean models since the similarity returns 0.0 when the similarity is null different from ratings model which returns np.nan

def most_similar_users(self, user_id, how_many=None):
'''
Return the most similar users to the given user, ordered
from most similar to least.

    Parameters
    -----------
   ....
    return np.array([to_user_id for to_user_id, pref in similarities \
        if user_id != to_user_id and not np.isnan(pref) and pref != 0.0])

Criar um MemoryModel

Um model baseado memoria que extende BaseModel.

Preocupar-se com a estrutura de dados a ser utilizada para ser o máximo otimizada.

Criar um BaseModel

Criar um BaseModel:

Seguir os padrões:

def UserIDs(self):
    '''
    Return all user IDs in the model, in order
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def PreferencesFromUser(self,userID,orderByID=True):
    '''
    Return user's preferences, ordered by user ID (if orderByID is True) 
    or by the preference values (if orderById is False), as an array.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def ItemIDsFromUser(self,userID):
    '''
    Return IDs of items user expresses a preference for 
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def ItemIDs(self):
    '''
    Return a iterator of all item IDs in the model, in order
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def PreferencesForItem(self,itemID,orderByID=True):
    '''
    Return all existing Preferences expressed for that item, 
    ordered by user ID (if orderByID is True) or by the preference values 
    (if orderById is False), as an array.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def PreferenceValue(self,userID,itemID):
    '''
    Retrieves the preference value for a single user and item.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def PreferenceTime(self,userID,itemID):
    '''
    Retrieves the time at which a preference value from a user and item was set, if known.
    Time is expressed in the usual way, as a number of milliseconds since the epoch.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def NumUsers(self):
    '''
    Return total number of users known to the model.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def NumItems(self):
    '''
    Return total number of items known to the model.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def NumUsersWithPreferenceFor(self,*itemIDs):
    '''
    Return the number of users who have expressed a preference for all of the items
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def setPreference(self,userID,itemID,value):
    '''
    Sets a particular preference (item plus rating) for a user.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def removePreference(self,userID, itemID):
    '''
    Removes a particular preference for a user.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def convertItemID2name(self, itemID):
    """Given item id number return item name"""
    raise NotImplementedError("cannot instantiate Abstract Base Class")

def convertUserID2name(self, userID):
    """Given user id number return user name"""
    raise NotImplementedError("cannot instantiate Abstract Base Class") 


def hasPreferenceValues(self):
    '''
    Return True if this implementation actually it is not a 'boolean' DataModel.
    Otherwise returns False.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")


def MaxPreference(self):
    '''
    Return the maximum preference value that is possible in the current problem domain being evaluated.
    For example, if the domain is movie ratings on a scale of 1 to 5, this should be 5. While  a recommender
    may estimate a preference value above 5.0, it isn't "fair" to consider that the system is actually
    suggesting an impossible rating of, say, 5.4 stars.
    In practice the application would cap this estimate to 5.0. Since evaluators evaluate
    the difference between estimated and actual value, this at least prevents this effect from unfairly
    penalizing a Recommender.
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")


def MinPreference(self):
    '''
    Returns the minimum preference value that is possible in the current problem domain being evaluated
    '''
    raise NotImplementedError("cannot instantiate Abstract Base Class")

Change the name of the repository and the name of the organization

We will change the project to be in a repository with a better name. The muricoca labs will be changed since we will have a new partner in the project.

We will also change the name of the organization for a better one to reflect the new commiters to this project.

Return of Recommenders with with_preference = True is all strings.

When calls from Recommenders the methods to compute the recommendations or similarities the return is coming all string:

For instance:
#1. Catch the nearest neighbors based on Friends.
neighborhood = NearestNeighborsStrategy()

    #2. Followers Similarity
    model = MongoBooleanUserDataModel('followers')
    similarity = UserSimilarity(model, jaccard_coefficient)

    f_model = MongoBooleanUserDataModel('friends')

    #2.Recommend Friends based on Friends Similarity.
    recsys_followers = SocialBasedRecommender(f_model, similarity, neighborhood)

    user_id = 'ricardocaspirro'
    assert_array_equal(np.array([[u'luciananunes', u'0.5'], [u'brunomelo', u'0.5']]),
                recsys_followers.recommend(user_id, 4))        

You can see the return above all string (0.5 , 0.5) the correct output would be both as float values.

Remove the dependacy of Scikit.Learn

The reason is that is comming with several issues to install scikit learn ad dependancy.

it will be removed and will be placed as an optional plugin integration.

_top_matches in Recommenders with wrong values at preferences

In _top_matches, the preferences is using the wrong value. It is necessary to change to the correctly one.

def _top_matches(self, source_id, target_ids, how_many=None, **params):
   ....
    #Empty target_ids
    if target_ids.size == 0:
        return np.array([])

    estimate_preferences = np.vectorize(self.estimate_preference)

    preferences = estimate_preferences(source_id, target_ids)

    preferences_values = preferences[~np.isnan(preferences)]

    target_ids = target_ids[~np.isnan(preferences)]

Develop the Matrix Factorization Recommender System

Based on the paper of Sarwar :

Sarwar, B.; Karypis, G.; Konstan, J.; Riedl, J. (2000), Application of Dimensionality Reduction in Recommender System A Case Study

and

NCREMENTAL SINGULAR VALUE. DECOMPOSITION ALGORITHMS FOR. HIGHLY SCALABLE RECOMMENDER. SYSTEMS (SARWAR ET AL)

Problems at printing the model

When the model is loaded with user_ids and item_ids as integers, the repr method throws an Exception.

It only works with strings as models.

Implement the representation as string (__repr__) of the DataModels

The current data models lack the representation as strings by using repr.

The goal is when you do print model , it will show a brief representation of the ratings matrix (based on the current ones) such as:

print model
MatrixDataModel (3 by 3)
red orange green
apple 2.000000 --- 1.000000
orange --- 2.000000 ---
celery --- --- 1.000000

Estimate_preference is not working propperly with boolean models

Estimate_preference is not working properly with boolean models.

def estimate_preference(self, user_id, item_id, **params):

    preference = self.model.preference_value(user_id, item_id)

    if not np.isnan(preference) and preference != 0.0:
        return preference

preference comes with 0.0, and it is considered a non-preference.

Implement the User Based Similarity

UserSimilarity that extends Similarity interface:

Returns the degree of similarity, of two users, based on the their preferences.
Implementations of this class define a notion of similarity between two users. 
Implementations should  return values in the range 0.0 to 1.0, with 1.0 representing 
perfect similarity.

Build the BaseRecommender

Build the BaseRecommender that will extend the BaseEstimator from Scikit.learn

This module will hold the main methods for recommendation process.

def recommend(self,userID,howMany,rescorer=None)

def estimatePreference(self,**args)

def allOtherItems(self,userID)

def setPreference(self,userID,itemID,value)

def removePreference(self,userID,itemID)

Investigate the main methods to be placed in this module.

In metrics/test_pairwise.py remove warnings

Fix the warnings in the test_pairwise.py (Investigate).

Marcel-2:tests marcelcaraciolo$ python test_pairwise.py
/Users/marcelcaraciolo/Desktop/muricoca/crab/crab/scikits/crab/metrics/pairwise.py:425: RuntimeWarning: invalid value encountered in divide
return np.dot(X,Y.T) / np.dot(np.sqrt(XX) ,np.sqrt(YY).T)
...../Users/marcelcaraciolo/Desktop/muricoca/crab/crab/scikits/crab/metrics/pairwise.py:147: RuntimeWarning: invalid value encountered in divide
return np.divide(num,den)
....

Create the DictDataModel

Develop the DictDataModel that extends the BaseDataModel.

This will store all the inputs that comes as Dictionary

{'user_id': { 'item_id': Rate, 'item_id2': Rate3,...}, 'user_id2: {... }}

in the Data Model in memory.

  • Investigate the best way to store in memory all the user x user matrix data.

Create the BaseSimilarity

A BaseSimilarity interface for using similarity between items or users.

class Similarity(object):
"""
Similarity Class - for similarity searches over a set of items/users.

In all instances, there is a data model against which we want to perform the
similarity search.

For each similarity search, the input is a item/user and the output are its
similarities to individual items/users.

Similarity queries are realized by calling ``self[query_item]``.
There is also a convenience wrapper, where iterating over `self` yields
similarities of each object in the model against the whole data model (ie.,
the query is each item/user in turn).


""" 

def __init__(self,model,distance,numBest=None):
    """ The constructor of Similarity class 

    `model` defines the data model where data is fetched.

    `distance` The similarity measured (function) between two vectors.

    If `numBest` is left unspecified, similarity queries return a full list (one
    float for every item in the model, including the query item).

    If `numBest` is set, queries return `numBest` most similar items, as a
    sorted list.


    """
    self.model = model
    self.distance = distance
    self.numBest = numBest


def getSimilarity(self,vec1,vec2):
    """
    Return similarity of a vector `vec1` to a specific vector `vec2` in the model.
    The vector is assumed to be either of unit length or empty.

    """
    raise NotImplementedError("cannot instantiate Abstract Base Class")


def getSimilarities(self,vec):
    """

    Return similarity of a vector `vec` to all vectors in the model.
    The vector is assumed to be either of unit length or empty.

    """
    raise NotImplementedError("cannot instantiate Abstract Base Class")


def __getitem__(self,vec):
    """
    Get similarities of a vector `vec` to all items in the model
    """
    allSims = self.getSimilarities(vec)

    #return either all similarities as a list, or only self.numBest most similar, depending on settings from the constructor

    if self.numBest is None:
        return allSims
    else:
        tops = [(label, sim) for label, sim in allSims]
        tops = sorted(tops, key = lambda item: -item[1]) # sort by -sim => highest sim first
        return tops[ : self.numBest] # return at most numBest top 2-tuples (label, sim)

Add Recommender Evaluator

Add the implementation of Recommender Evaluator which will evaluate the recommender metrics using the cross-validation functions available for the data set.

Print the recommendations with the strings

Currently we use the user ids and item ids at showing the recommendations.
The idea is to develop a method to show the string from the user_ids and item_ids.

This task can be done using a method in the model that can show those strings.

def userid2string(self, user_id):

def item_id2string(self, item_id):

Merging Ocelma python-recsys project into Crab

The goal here is to analyze how to merge the Ocelma python-recsys project into the current Crab. It will be a stand-alone project for both commiters in order to implemente new recommender algorithms.

Can´t access 'usage' in the wiki

clicking on the 'usage'-link in the wiki does not return any information. Only the possibility to update the wiki, which does not make sense before having read anything.
Thanks!

Pip install problem

Currently I get this when trying to install crab:

error in scikits.crab setup command: Distribution contains no modules or packages for namespace package 'scikits'

My installation attempts included

pip install git+git://github.com/muricoca/crab.git

and

pip install -e git+git://github.com/muricoca/crab.git#egg=crab

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.