benfred / bens-blog-code Goto Github PK

View Code? Open in Web Editor NEW

315.0 315.0 89.0 157 KB

code snippets from my blog

Home Page: http://www.benfrederickson.com/blog

HTML 3.13% JavaScript 53.51% Python 40.80% Thrift 0.10% C++ 1.81% Rust 0.66%

bens-blog-code's People

Contributors

Stargazers

Watchers

Forkers

qichenftw sethwoodworth bearhomeng fionacsukcl 1060460048 janardan111 robotiko altoids yodamaster brttotty eraldop lenovor xsongx fanlu litoupu adrianhust shanpohe moming1990 apcwowo zhiyue-archive cbushell slideclick ls-key sidneysu youfar jjdblast robdoherty2 muhammad-maf lhslll alexbigboy valenti1234 marvinliu0810 wsdflink rwzhao pedrobraz1990 pandasasa yilab zhkflame shgidi kanghh highflykxf jockeyyan akingseu tvkpz wuyanan520 joezhao84 th13 binh-forked-projects junhui-umsi zxyustc coderx7 brahmaslee zeitos agistrueai robvium vandanavk nssr-dev leochen-ai basimkalaf 997068329 nwut lifeinoppo cometyang vaibhavad sergio-vm bigrlab latestalexey andzi synfinrst answer1992 sojvai dconstan ptiwaree laos1984 kunlqt springcoil hosein74 baanshabriri sibyllalee1688 saadmahboob mimbres manmitya tejastank iliassjabali kangzhenkang greatthan365 djoguns lequi iq-scm

bens-blog-code's Issues

Approximate ALS: Initialization time

This is a very nice blog post, thanks! From what I understand reading the code, the times here exclude the initialization time of the different ANN methods, right ? I had the impression that NMSLIB's "weakness", so to speak, was that building up the initial graphs could be quite lengthy. Would your conclusions change if you were to include the initialization time on a very big dataset (millions of users and/or items) ?

Question on BM25

Great blog posts! I had one confusion I hope you dont mind being asked here.

I have calculated BM25 distance for a query against documents to get a score for each query / document pair. Is that what you are doing with the BM25 weighting function? Like https://en.wikipedia.org/wiki/Okapi_BM25?

You then use the weighting it appears in a cosine distance - which confused me because I thought BM25 was itself a distance score?

Reproducing the plots

Hi, I was trying to reproduce the plots and was confused a bit by the steps:

So I first did:

python create_ann_benchmarks_data.py

This created a file called lastfm50-10000--1-3.npz. After that, I am not sure what do you mean by copy to the 'queries' folder in your ann-benchmarks.

When I tried to ignore that and moved on to the next instruction:

# dataset is the path to the .npz file
python ann_benchmarks/main.py --dataset lastfm50-10000--1-3.npz --algo 'hnsw(nmslib)' --algo faiss --algo annoy

I get ModuleNotFoundError: No module named 'ann_benchmarks'. Some guidance will be much appreciated. Thanks

Performance Hack

Hey Ben,

So I was perusing your blog entry/the code involved and I was struck by how you handle sparse matrices.

Your performance hack is quite clever and works with the situation given, but in the circumstance I have actually personally found a (somewhat trivial) fix that involves the core Numpy/Scipy libraries involved with linear algebra.

Due to the inherent nature of sparse matrix multiplication within the libraries dot product functions of large matrices (specifically with regard to cosine similarity, the exact case I have applied the following technique to) there is a much more efficient method that actually yields the entire calculation.

The core issue is that these methods per the libraries' base definitions are C-Contiguous rather than F-Contiguous. Fortran is light years faster than C at any linalg process you can throw at it with the appropriate libraries.

In an applied, only dot-product system I utilized this method and wrote on it
https://medium.com/@_devbob/from-0-to-warp-speed-b780a2bc36ce

However, after I published this facebook polished a much more robust method that utilized Intel's entire kernel math lib to override standard BLAS libraries. I cannot find the post now (it may have been removed for whatever reason) but I will search my bookmarks and see if I can provide you with it tomorrow sometime (its quite late here)

Anywho, not a huge thing but it was quite literally a 1000x speedup on the operations w.r.t sparse matrix cosine similarity specifically.

Cheers,
Bobby

Maximum inner product for Annoy

Just read http://www.benfrederickson.com/approximate-nearest-neighbours-for-recommender-systems/

How are you running Annoy for maximum inner product? Annoy doesn't support it out of the box, so I wonder if that's the reason it doesn't perform well

truth value of a Series is ambiguous...

trying to work through your blog post

in the section "map each artist to a sparse vector of their users", I keep getting a "The truth value of a Series is ambiguous" error. Well, after a long calculation.

Specifically, this snippet of code:
for artist, group in data.groupby('artist')
seems to be the offending line

Might there be some library you're loading which may have missed getting mentioned in the blog? Or perhaps a Python 2 vs Python 3 issue?

Use highest pickle protocol

http://docs.python.org/2/library/pickle.html

There are currently 3 different protocols which can be used for pickling.

Protocol version 0 is the original ASCII protocol and is backwards compatible
with earlier versions of Python. Protocol version 1 is the old binary format
which is also compatible with earlier versions of Python. Protocol version 2
was introduced in Python 2.3. It provides much more efficient pickling of
new-style classes.

Refer to PEP 307 for more information.

If a protocol is not specified, protocol 0 is used. If protocol is specified
as a negative value or HIGHEST_PROTOCOL, the highest protocol version
available will be used.

This benchmark uses the old, slow pickle protocol.

Pickle is fast for some cases

try this and see the output.
Would appreciate if you can explain me why test case in attached file shows pickle is faster while json is slower. [I do not understand what is happening here]. I merely stumbled onto this test case because of the use case in my project.

test.txt

Memory cost of nmslib and faiss comparision

hi, I have read the amazing post "approximate-nearest-neighbours-for-recommender-systems", thx for the great psot! I have one question: when you do the experiments, what about the memory cost of lib faiss and nmslib, do they have a obvious gap ? thank you : )

what is filename in your code?

Hi,

Thanks for the matrix factorization blog. I am trying to understand it. I should be starting with cleaning the last.fm files. When I run your code:

where do I define or point that "filename" in your code below is "usersha1-artmbid-artname-plays.tsv" ?
Seems like you are missing a couple of lines. on how to execute these functions.

class MusicData(object):
def init(self, filename):
# load TSV file from disk
self.data = pandas.read_table(filename,
usecols=[0, 2, 3],
names=['user', 'artist', 'plays'])

Thank you so much.

Best regards,
Sagar

p.s. new to python. Also, I am not clear what you are doing here. thanks.

benfred / bens-blog-code Goto Github PK

bens-blog-code's People

Contributors

Stargazers

Watchers

Forkers

bens-blog-code's Issues

Approximate ALS: Initialization time

Question on BM25

Reproducing the plots

Performance Hack

Maximum inner product for Annoy

truth value of a Series is ambiguous...

Use highest pickle protocol

Pickle is fast for some cases

Memory cost of nmslib and faiss comparision

what is filename in your code?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent