zhengruifeng / spark-libfm Goto Github PK

View Code? Open in Web Editor NEW

248.0 248.0 120.0 46 KB

An implement of Factorization Machines (LibFM)

License: Apache License 2.0

Scala 100.00%

spark-libfm's People

Contributors

Stargazers

Watchers

Forkers

mindis kbahtus lizhangzhan qiuwuyiye irwenqiang goodluckmrlee birdgun mariusvniekerk alexchao2012 evgeni-adcash joeyqzhou iwii0425 shaoyonghua nihaoucas kobedeshow benmccann hydrotoast wq882003 trietnm2 lixinzhang orenov 466152112 lixiangbao wanjun0511 e-hu ian09 hqzizania colinsongf mpjlu woodstone121 xiaokekehaha zdx mlnick durgaswaroop codeaudit ck8275411 tspannhw wmastersonv vinceshieh kevinmtian gubobo byr-ly dukeyuan gokunwu chenbk85 leezqcst iele sandy4321 pengjiwu blazingsiyan withdataman yangwucoder cuilong17 clz0724 luoyexuge yuanqingz dwroy wzhe06 lilonghua1987 ssyue cbangyu-zju iftechio killallkill sampsonguo yaakovsu haofengrushui204 sigmajiangcn zfh01234 poson jinzitian chanplusplus flyingkiss zjzshyq leemo2k10 sundeepks zangruirui shenlanyilang adangadang nick-meng mmmika xiegonghai wanesta wujian555 zyq11223 knowledgehacker baolinji yuhai-china yangyw08 letterkey wsgtcyx codecolliepup zakra frankfqchen xiaoqingwang fabs2017 yueyedeai umbrellalala liyinchao zsyf102900 morrowind1983

spark-libfm's Issues

issue found with cross term evaluation

Hi Ruifeng,

I have run a dataset thru your FM engine, which I generated, and contains specific cross term relationships. With Rendle’s libfm project I am getting the expected results, however, when using your library, I am only getting noise, or rather false positives. Can you provide some usage tips, or would you be willing to run thru the data set and look for possible errors somewhere in the code?

I am using the Gradient Descent optimization algorithm.

All individual weights come out as zero, and I have tried using different values for the learning rate as well. However I’m not actually concerned with the individual weights, and I’m considering this to be a symptom of the underlying problem.

My concern is to locate outlier cross terms in the data. With your library I don’t get a single expected cross term, but with libfm I get all the cross terms, and only a handful of false postives as well.

Here’s my expected cross – term list:
1,7
3,9
5,10
6,12
14,15
16,17
19,20

My method of finding the cross term is to:
Take the output matrix of the model F
C = F * F_transpose
Use C to lookup the terms of interest by striping by row, compute mean and variance, assume a normal distribution and look for the upper terms exceeding a threshold. If no terms are found then I decrease the threshold gradually to a point until I either find some “outliers” or find none.
Examine the list of "outliers" for my cross terms of interest, and I don't care about the order.
Let me re-state that using the same method works using Rendle’s libfm engine. I have tried replicating the algorithm parameters that I used in his library as well, when running with your code in Spark.

Thank you,
Karl

Can this library predict movie rating?

Hi,

I am planning to use this library for the movie recommendation. However, I was wondering if this can library predict movie rating?

Thanks.

How to deal with the NAN in the weightVector

Use enum instead of int

task is currently an int:

@param task 0 for Regression, and 1 for Binary Classification

Calling it looks like:

FMWithSGD.train(trainingData.rdd(), 1, numIterations);

Which is much less clear than something like:

FMWithSGD.train(trainingData.rdd(), REGRESSION, numIterations);

I would suggest using an enum instead of an int. This is exactly the type of case where enums are meant to be used.

Typos

In FMWithSGD:
numFeaturs -> numFeatures
FMModle -> FMModel

The implementation cannot be imported as MVN project by IntelliJ IDEA, how to deal with this

I use IntelliJ IDEA to import the project, while some files seems lacked that the translator cannot refer it as MVN project. Could you please tell how to fix it?

when generate FM model, but, why can't load the model to predict?

hello, when generate FM model, but, why can't load the model to predict?
Error as follow:
Exception in thread "main" java.lang.Exception: FModel.load did not recognize model with (className, format version):(org.apache.spark.mllib.classification.FMModel$SaveLoadV1_0$, 1.0).

`FMWithSGD` default constructor parameters are inconsistent/too small

From the FMWithSGD file:

  /**
    * Construct an object with default parameters: {task: 0, stepSize: 1.0, numIterations: 100,
    * dim: (true, true, 8), regParam: (0, 0.01, 0.01), miniBatchFraction: 1.0}.
    */
  def this() = this(0, 1.0, 100, (true, true, 8), (0, 1e-3, 1e-4), 1e-5)

The comment is inconsistent with the actual values passed.

It is also worth noting that 1e-5 may be too small a fraction size to train over all parameters. Since the GradientDescent implementation in Scala performs numIterations iterations of mini batch SGD with batch size miniBatchFraction, it follows that approximately numIterations * miniBatchFraction labeled points are updated. For numIterations = 100 and miniBatchFraction = 1e-5, this means only a maximum of 1e-3 labeled points are actually used during training!

Further implications: since the model has a set of parameters per feature, this means that if a feature is unseen during training, then they will simply be initialized with their default values: latent vectors initialized from a Normal distribution and weights initialized to 0.0.

bad performance when faced to large data set

I have a large data set(2000w rows), which features are (user, topic)(5000 one-hot columns)
And label is (0, 1)
Logistic regression's AUC could easily reach 0.84, but FM's AUC is just around 0.5, or 0.46 perhaps.

The parameter I used (both SGD and LBFGS I've tried):

        val model = FMWithSGD.train(training, task = 1, numIterations = 100, stepSize = 0.15, 
                miniBatchFraction = 1.0, dim = (true, true, 10), regParam = (0, 0, 0), initStd = 0.1)

And

       val model = FMWithLBFGS.train(training, task = 1, numIterations = 20, 
                numCorrections = 5, dim = (true, true, 20), regParam = (0, 0, 0), initStd = 0.1);

Could you point out how can I get a better performance with FM?

what is the meaning of numCorrections input parameter in FMWithLBFGS.train function?

it seems that the numCorrections parameter is not used.

I got an error when run FMWithLBFGS.train

I got an error when run FMWithLBFGS.train. The message is

19/03/18 16:02:48 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
19/03/18 16:02:48 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
19/03/18 16:02:58 INFO StrongWolfeLineSearch: Line search t: 0.11782555073452997 fval: 0.18211103495544895 rhs: 0.2606692174807194 cdd: 1.0498989669795253
19/03/18 16:02:58 INFO LBFGS: Step Size: 0.1178
19/03/18 16:02:58 INFO LBFGS: Val and Grad Norm: 0.182111 (rel: 0.302) 0.509790
19/03/18 16:03:02 INFO StrongWolfeLineSearch: Line search t: 0.36638861081271834 fval: 0.18211103495544895 rhs: 0.18211004643573764 cdd: -0.0115736658307402
19/03/18 16:03:05 INFO StrongWolfeLineSearch: Line search t: 0.09905059408456629 fval: 0.18211103495544895 rhs: 0.18211076771607038 cdd: -0.02229934335451164
19/03/18 16:03:06 INFO StrongWolfeLineSearch: Line search t: 0.02245692390442945 fval: 0.18211103495544895 rhs: 0.1821109743664697 cdd: -0.02588247744386086
19/03/18 16:03:08 INFO StrongWolfeLineSearch: Line search t: 0.004822954619910672 fval: 0.18211103495544895 rhs: 0.18211102194307474 cdd: -0.02674252026232241
19/03/18 16:03:10 INFO StrongWolfeLineSearch: Line search t: 0.0010227603171640213 fval: 0.18211103495544895 rhs: 0.18211103219603256 cdd: -0.026929624963460522
19/03/18 16:03:12 INFO StrongWolfeLineSearch: Line search t: 2.1629420637255935E-4 fval: 0.18211103495544895 rhs: 0.18211103437188528 cdd: -0.026969412562859015
19/03/18 16:03:15 INFO StrongWolfeLineSearch: Line search t: 4.5715477240673296E-5 fval: 0.18211103495544895 rhs: 0.1821110348321082 cdd: -0.02697783181839336
19/03/18 16:03:19 INFO StrongWolfeLineSearch: Line search t: 9.661135718329927E-6 fval: 0.18211103495544895 rhs: 0.18211103492938313 cdd: -0.026979611515011322
19/03/18 16:03:20 INFO StrongWolfeLineSearch: Line search t: 2.041652436213748E-6 fval: 0.18211103495544895 rhs: 0.18211103494994055 cdd: -0.026979987631432584
19/03/18 16:03:23 INFO StrongWolfeLineSearch: Line search t: 4.3145256182019533E-7 fval: 0.18211103495544895 rhs: 0.18211103495428488 cdd: -0.02698006711517943
19/03/18 16:03:23 ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search zoom failed
19/03/18 16:03:28 INFO StrongWolfeLineSearch: Line search t: 0.2313204034196764 fval: 0.18211103495544895 rhs: 0.18210502325583827 cdd: 0.15801565968824438
19/03/18 16:03:29 INFO StrongWolfeLineSearch: Line search t: 0.10214095643616886 fval: 0.18211103495544895 rhs: 0.1821083804522622 cdd: 0.047316513760600414
19/03/18 16:03:29 INFO StrongWolfeLineSearch: Line search t: 0.03727126336829138 fval: 0.18211103495544895 rhs: 0.18211006632649893 cdd: -0.11267143347262713
19/03/18 16:03:35 INFO StrongWolfeLineSearch: Line search t: 0.010054503272280822 fval: 0.18211103495544895 rhs: 0.18211077365271683 cdd: -0.2152298221473921
19/03/18 16:03:39 INFO StrongWolfeLineSearch: Line search t: 0.002278000948200439 fval: 0.18211103495544895 rhs: 0.1821109757533327 cdd: -0.2494210814426842
19/03/18 16:03:41 INFO StrongWolfeLineSearch: Line search t: 4.891533182012405E-4 fval: 0.18211103495544895 rhs: 0.18211102224302597 cdd: -0.2576215252050019
19/03/18 16:03:42 INFO StrongWolfeLineSearch: Line search t: 1.0372658433615846E-4 fval: 0.18211103495544895 rhs: 0.18211103225973746 cdd: -0.2594052144599783
19/03/18 16:03:49 INFO StrongWolfeLineSearch: Line search t: 2.1936016937586954E-5 fval: 0.18211103495544895 rhs: 0.182111034385362 cdd: -0.25978449808886667
19/03/18 16:03:51 INFO StrongWolfeLineSearch: Line search t: 4.636341753405551E-6 fval: 0.18211103495544895 rhs: 0.1821110348349568 cdd: -0.25986475569277256
19/03/18 16:03:52 INFO StrongWolfeLineSearch: Line search t: 9.798062777729317E-7 fval: 0.18211103495544895 rhs: 0.18211103492998512 cdd: -0.2598817208409102
19/03/18 16:03:52 ERROR LBFGS: Failure again! Giving up and returning. Maybe the objective is just poorly behaved?
19/03/18 16:03:52 INFO LBFGS: Converged because line search failed!

But there's no error when I use FMWithSGD.train with the same data set. I don't know why? Thanks for your help!

Use 1/0 labels for binary classification instead of 1/-1

The loss function used in this library for binary classification is a hinge-loss function assuming labels +1 or -1:

case 1 =>
  1 - Math.signum(pred * label)

However, the predictions being made are in the range 0-1:

case 1 =>
  1.0 / (1.0 + Math.exp(-pred))

The 1 / 0 used in predictions should be preferred to the 1 / -1 expected in the loss function because the negative label is represented by 0 in spark.mllib instead of −1, to be consistent with multiclass labeling.

The loss function should be changed to be more like the way Spark does it.

Unused dependencies

None of the dependencies specified in the build file are actually used except for spark-mllib. It would be good to delete the others

Got error while running spark-libFM with LBFGS

It always get this error and fail.
What does this error mean? How can i solve this error?

The error is as follow:

16/04/19 18:47:35 ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search zoom failed
16/04/19 18:49:18 ERROR LBFGS: Failure again! Giving up and returning. Maybe the objective is just poorly behaved?

Deprecated API calls

saveAsParquetFile should be replaced by write.parquet

And parquetFile should be replaced by read.parquet

Can we use spark-libFM via Spark streaming ?

Idea is to have online learning for Factorization Machine similiar to StreamingLogisticRegressionWithSGD any pointers on the same ?