Giter Site home page Giter Site logo

spark-libfm's People

Contributors

zhengruifeng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-libfm's Issues

issue found with cross term evaluation

Hi Ruifeng,

I have run a dataset thru your FM engine, which I generated, and contains specific cross term relationships. With Rendle’s libfm project I am getting the expected results, however, when using your library, I am only getting noise, or rather false positives. Can you provide some usage tips, or would you be willing to run thru the data set and look for possible errors somewhere in the code?

I am using the Gradient Descent optimization algorithm.

All individual weights come out as zero, and I have tried using different values for the learning rate as well. However I’m not actually concerned with the individual weights, and I’m considering this to be a symptom of the underlying problem.

My concern is to locate outlier cross terms in the data. With your library I don’t get a single expected cross term, but with libfm I get all the cross terms, and only a handful of false postives as well.

Here’s my expected cross – term list:
1,7
3,9
5,10
6,12
14,15
16,17
19,20

My method of finding the cross term is to:
Take the output matrix of the model F
C = F * F_transpose
Use C to lookup the terms of interest by striping by row, compute mean and variance, assume a normal distribution and look for the upper terms exceeding a threshold. If no terms are found then I decrease the threshold gradually to a point until I either find some “outliers” or find none.
Examine the list of "outliers" for my cross terms of interest, and I don't care about the order.
Let me re-state that using the same method works using Rendle’s libfm engine. I have tried replicating the algorithm parameters that I used in his library as well, when running with your code in Spark.

Thank you,
Karl

Use enum instead of int

task is currently an int:

@param task 0 for Regression, and 1 for Binary Classification

Calling it looks like:

FMWithSGD.train(trainingData.rdd(), 1, numIterations);

Which is much less clear than something like:

FMWithSGD.train(trainingData.rdd(), REGRESSION, numIterations);

I would suggest using an enum instead of an int. This is exactly the type of case where enums are meant to be used.

Typos

In FMWithSGD:
numFeaturs -> numFeatures
FMModle -> FMModel

when generate FM model, but, why can't load the model to predict?

hello, when generate FM model, but, why can't load the model to predict?
Error as follow:
Exception in thread "main" java.lang.Exception: FModel.load did not recognize model with (className, format version):(org.apache.spark.mllib.classification.FMModel$SaveLoadV1_0$, 1.0).

`FMWithSGD` default constructor parameters are inconsistent/too small

From the FMWithSGD file:

  /**
    * Construct an object with default parameters: {task: 0, stepSize: 1.0, numIterations: 100,
    * dim: (true, true, 8), regParam: (0, 0.01, 0.01), miniBatchFraction: 1.0}.
    */
  def this() = this(0, 1.0, 100, (true, true, 8), (0, 1e-3, 1e-4), 1e-5)

The comment is inconsistent with the actual values passed.

It is also worth noting that 1e-5 may be too small a fraction size to train over all parameters. Since the GradientDescent implementation in Scala performs numIterations iterations of mini batch SGD with batch size miniBatchFraction, it follows that approximately numIterations * miniBatchFraction labeled points are updated. For numIterations = 100 and miniBatchFraction = 1e-5, this means only a maximum of 1e-3 labeled points are actually used during training!

Further implications: since the model has a set of parameters per feature, this means that if a feature is unseen during training, then they will simply be initialized with their default values: latent vectors initialized from a Normal distribution and weights initialized to 0.0.

bad performance when faced to large data set

I have a large data set(2000w rows), which features are (user, topic)(5000 one-hot columns)
And label is (0, 1)
Logistic regression's AUC could easily reach 0.84, but FM's AUC is just around 0.5, or 0.46 perhaps.

The parameter I used (both SGD and LBFGS I've tried):

        val model = FMWithSGD.train(training, task = 1, numIterations = 100, stepSize = 0.15, 
                miniBatchFraction = 1.0, dim = (true, true, 10), regParam = (0, 0, 0), initStd = 0.1)

And

       val model = FMWithLBFGS.train(training, task = 1, numIterations = 20, 
                numCorrections = 5, dim = (true, true, 20), regParam = (0, 0, 0), initStd = 0.1);

Could you point out how can I get a better performance with FM?

I got an error when run FMWithLBFGS.train

I got an error when run FMWithLBFGS.train. The message is

19/03/18 16:02:48 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
19/03/18 16:02:48 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
19/03/18 16:02:58 INFO StrongWolfeLineSearch: Line search t: 0.11782555073452997 fval: 0.18211103495544895 rhs: 0.2606692174807194 cdd: 1.0498989669795253
19/03/18 16:02:58 INFO LBFGS: Step Size: 0.1178
19/03/18 16:02:58 INFO LBFGS: Val and Grad Norm: 0.182111 (rel: 0.302) 0.509790
19/03/18 16:03:02 INFO StrongWolfeLineSearch: Line search t: 0.36638861081271834 fval: 0.18211103495544895 rhs: 0.18211004643573764 cdd: -0.0115736658307402
19/03/18 16:03:05 INFO StrongWolfeLineSearch: Line search t: 0.09905059408456629 fval: 0.18211103495544895 rhs: 0.18211076771607038 cdd: -0.02229934335451164
19/03/18 16:03:06 INFO StrongWolfeLineSearch: Line search t: 0.02245692390442945 fval: 0.18211103495544895 rhs: 0.1821109743664697 cdd: -0.02588247744386086
19/03/18 16:03:08 INFO StrongWolfeLineSearch: Line search t: 0.004822954619910672 fval: 0.18211103495544895 rhs: 0.18211102194307474 cdd: -0.02674252026232241
19/03/18 16:03:10 INFO StrongWolfeLineSearch: Line search t: 0.0010227603171640213 fval: 0.18211103495544895 rhs: 0.18211103219603256 cdd: -0.026929624963460522
19/03/18 16:03:12 INFO StrongWolfeLineSearch: Line search t: 2.1629420637255935E-4 fval: 0.18211103495544895 rhs: 0.18211103437188528 cdd: -0.026969412562859015
19/03/18 16:03:15 INFO StrongWolfeLineSearch: Line search t: 4.5715477240673296E-5 fval: 0.18211103495544895 rhs: 0.1821110348321082 cdd: -0.02697783181839336
19/03/18 16:03:19 INFO StrongWolfeLineSearch: Line search t: 9.661135718329927E-6 fval: 0.18211103495544895 rhs: 0.18211103492938313 cdd: -0.026979611515011322
19/03/18 16:03:20 INFO StrongWolfeLineSearch: Line search t: 2.041652436213748E-6 fval: 0.18211103495544895 rhs: 0.18211103494994055 cdd: -0.026979987631432584
19/03/18 16:03:23 INFO StrongWolfeLineSearch: Line search t: 4.3145256182019533E-7 fval: 0.18211103495544895 rhs: 0.18211103495428488 cdd: -0.02698006711517943
19/03/18 16:03:23 ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search zoom failed
19/03/18 16:03:28 INFO StrongWolfeLineSearch: Line search t: 0.2313204034196764 fval: 0.18211103495544895 rhs: 0.18210502325583827 cdd: 0.15801565968824438
19/03/18 16:03:29 INFO StrongWolfeLineSearch: Line search t: 0.10214095643616886 fval: 0.18211103495544895 rhs: 0.1821083804522622 cdd: 0.047316513760600414
19/03/18 16:03:29 INFO StrongWolfeLineSearch: Line search t: 0.03727126336829138 fval: 0.18211103495544895 rhs: 0.18211006632649893 cdd: -0.11267143347262713
19/03/18 16:03:35 INFO StrongWolfeLineSearch: Line search t: 0.010054503272280822 fval: 0.18211103495544895 rhs: 0.18211077365271683 cdd: -0.2152298221473921
19/03/18 16:03:39 INFO StrongWolfeLineSearch: Line search t: 0.002278000948200439 fval: 0.18211103495544895 rhs: 0.1821109757533327 cdd: -0.2494210814426842
19/03/18 16:03:41 INFO StrongWolfeLineSearch: Line search t: 4.891533182012405E-4 fval: 0.18211103495544895 rhs: 0.18211102224302597 cdd: -0.2576215252050019
19/03/18 16:03:42 INFO StrongWolfeLineSearch: Line search t: 1.0372658433615846E-4 fval: 0.18211103495544895 rhs: 0.18211103225973746 cdd: -0.2594052144599783
19/03/18 16:03:49 INFO StrongWolfeLineSearch: Line search t: 2.1936016937586954E-5 fval: 0.18211103495544895 rhs: 0.182111034385362 cdd: -0.25978449808886667
19/03/18 16:03:51 INFO StrongWolfeLineSearch: Line search t: 4.636341753405551E-6 fval: 0.18211103495544895 rhs: 0.1821110348349568 cdd: -0.25986475569277256
19/03/18 16:03:52 INFO StrongWolfeLineSearch: Line search t: 9.798062777729317E-7 fval: 0.18211103495544895 rhs: 0.18211103492998512 cdd: -0.2598817208409102
19/03/18 16:03:52 ERROR LBFGS: Failure again! Giving up and returning. Maybe the objective is just poorly behaved?
19/03/18 16:03:52 INFO LBFGS: Converged because line search failed!

But there's no error when I use FMWithSGD.train with the same data set. I don't know why? Thanks for your help!

Use 1/0 labels for binary classification instead of 1/-1

The loss function used in this library for binary classification is a hinge-loss function assuming labels +1 or -1:

case 1 =>
  1 - Math.signum(pred * label)

However, the predictions being made are in the range 0-1:

case 1 =>
  1.0 / (1.0 + Math.exp(-pred))

The 1 / 0 used in predictions should be preferred to the 1 / -1 expected in the loss function because the negative label is represented by 0 in spark.mllib instead of −1, to be consistent with multiclass labeling.

The loss function should be changed to be more like the way Spark does it.

Unused dependencies

None of the dependencies specified in the build file are actually used except for spark-mllib. It would be good to delete the others

Got error while running spark-libFM with LBFGS

It always get this error and fail.
What does this error mean? How can i solve this error?

The error is as follow:


16/04/19 18:47:35 ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search zoom failed
16/04/19 18:49:18 ERROR LBFGS: Failure again! Giving up and returning. Maybe the objective is just poorly behaved?

Deprecated API calls

saveAsParquetFile should be replaced by write.parquet

And parquetFile should be replaced by read.parquet

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.