Giter Site home page Giter Site logo

bens-blog-code's People

Contributors

benfred avatar sethwoodworth avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bens-blog-code's Issues

Approximate ALS: Initialization time

This is a very nice blog post, thanks! From what I understand reading the code, the times here exclude the initialization time of the different ANN methods, right ? I had the impression that NMSLIB's "weakness", so to speak, was that building up the initial graphs could be quite lengthy. Would your conclusions change if you were to include the initialization time on a very big dataset (millions of users and/or items) ?

Question on BM25

Great blog posts! I had one confusion I hope you dont mind being asked here.

I have calculated BM25 distance for a query against documents to get a score for each query / document pair. Is that what you are doing with the BM25 weighting function? Like https://en.wikipedia.org/wiki/Okapi_BM25?

You then use the weighting it appears in a cosine distance - which confused me because I thought BM25 was itself a distance score?

Reproducing the plots

Hi, I was trying to reproduce the plots and was confused a bit by the steps:

So I first did:

python create_ann_benchmarks_data.py

This created a file called lastfm50-10000--1-3.npz. After that, I am not sure what do you mean by copy to the 'queries' folder in your ann-benchmarks.

When I tried to ignore that and moved on to the next instruction:

# dataset is the path to the .npz file
python ann_benchmarks/main.py --dataset lastfm50-10000--1-3.npz --algo 'hnsw(nmslib)' --algo faiss --algo annoy

I get ModuleNotFoundError: No module named 'ann_benchmarks'. Some guidance will be much appreciated. Thanks

Performance Hack

Hey Ben,

So I was perusing your blog entry/the code involved and I was struck by how you handle sparse matrices.

Your performance hack is quite clever and works with the situation given, but in the circumstance I have actually personally found a (somewhat trivial) fix that involves the core Numpy/Scipy libraries involved with linear algebra.

Due to the inherent nature of sparse matrix multiplication within the libraries dot product functions of large matrices (specifically with regard to cosine similarity, the exact case I have applied the following technique to) there is a much more efficient method that actually yields the entire calculation.

The core issue is that these methods per the libraries' base definitions are C-Contiguous rather than F-Contiguous. Fortran is light years faster than C at any linalg process you can throw at it with the appropriate libraries.

In an applied, only dot-product system I utilized this method and wrote on it
https://medium.com/@_devbob/from-0-to-warp-speed-b780a2bc36ce

However, after I published this facebook polished a much more robust method that utilized Intel's entire kernel math lib to override standard BLAS libraries. I cannot find the post now (it may have been removed for whatever reason) but I will search my bookmarks and see if I can provide you with it tomorrow sometime (its quite late here)

Anywho, not a huge thing but it was quite literally a 1000x speedup on the operations w.r.t sparse matrix cosine similarity specifically.

Cheers,
Bobby

truth value of a Series is ambiguous...

trying to work through your blog post

in the section "map each artist to a sparse vector of their users", I keep getting a "The truth value of a Series is ambiguous" error. Well, after a long calculation.

Specifically, this snippet of code:
for artist, group in data.groupby('artist')
seems to be the offending line

Might there be some library you're loading which may have missed getting mentioned in the blog? Or perhaps a Python 2 vs Python 3 issue?

Use highest pickle protocol

http://docs.python.org/2/library/pickle.html

There are currently 3 different protocols which can be used for pickling.

Protocol version 0 is the original ASCII protocol and is backwards compatible
with earlier versions of Python. Protocol version 1 is the old binary format
which is also compatible with earlier versions of Python. Protocol version 2
was introduced in Python 2.3. It provides much more efficient pickling of
new-style classes.

Refer to PEP 307 for more information.

If a protocol is not specified, protocol 0 is used. If protocol is specified
as a negative value or HIGHEST_PROTOCOL, the highest protocol version
available will be used.

This benchmark uses the old, slow pickle protocol.

Pickle is fast for some cases

try this and see the output.
Would appreciate if you can explain me why test case in attached file shows pickle is faster while json is slower. [I do not understand what is happening here]. I merely stumbled onto this test case because of the use case in my project.

test.txt

Memory cost of nmslib and faiss comparision

hi, I have read the amazing post "approximate-nearest-neighbours-for-recommender-systems", thx for the great psot! I have one question: when you do the experiments, what about the memory cost of lib faiss and nmslib, do they have a obvious gap ? thank you : )

what is filename in your code?

Hi,

Thanks for the matrix factorization blog. I am trying to understand it. I should be starting with cleaning the last.fm files. When I run your code:

where do I define or point that "filename" in your code below is "usersha1-artmbid-artname-plays.tsv" ?
Seems like you are missing a couple of lines. on how to execute these functions.

class MusicData(object):
def init(self, filename):
# load TSV file from disk
self.data = pandas.read_table(filename,
usecols=[0, 2, 3],
names=['user', 'artist', 'plays'])

Thank you so much.

Best regards,
Sagar

p.s. new to python. Also, I am not clear what you are doing here. thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.