Comments (11)
libFM uses a time dependent seed for the random initialization by default.
"seed", "integer value, default=None"
https://github.com/srendle/libfm/blob/master/src/libfm/libfm.cpp#L93
I think the results between runs should match if you set a seed.
from libfm.
Using the same seed indeed prevents differences between runs. But what I try to report here is that the per-iteration training set and test set 'performance' differs, although I supplied the same data for both sets. I.e. in the snippet above, the train performance for iteration 99 is 0.52756, while the test performance on the same data is 0.530803. If I understand correctly, these numbers should be equal since the input data is equal.
This is based on my assumption that they are produced by computing some performance metric (like fraction correctly classified) on the predictions of the model (with parameters from that iteration), using either the training set and the validation set as input. But that assumption might be wrong.
from libfm.
Can you check if this is also true with the option --method=ALS
?
from libfm.
Yes. With libFM -task c -train train.libfm -test train.libfm -method als
there still is a small difference between the train and test scores.
from libfm.
How small is the difference compared to the difference with MCMC? Is it it plausible that's just a small numerical error? Which error is correct (train or test)? You can use the last error and compare it against what you get when calculating the error yourself.
from libfm.
I generated some artificial data with this Python script:
import random
with open('train.libfm', 'w') as f:
for i in range(1000):
# Write class.
if i % 2 == 0:
f.write('0')
else:
f.write('1')
for j in range(100):
f.write(' %d:%f' % (j, random.normalvariate(0, 1)))
f.write('\n')
It generates alternating target labels, with 100 dense random features. The output looks like this:
...
#Iter= 97 Train=0.925 Test=0.997 Test(ll)=0.0801822
#Iter= 98 Train=0.913 Test=0.997 Test(ll)=0.0798717
#Iter= 99 Train=0.919 Test=0.997 Test(ll)=0.079558
It seems that it is overfitting, because the features are not informative. The difference is now relatively big. I have saved the output with the --out
flag, and the results reported for Test=
correspond to the accuracy calculated manually. So that part seems right. What could have caused the Train=
score to deviate so much?
from libfm.
I think that the test score is calculated here: https://github.com/srendle/libfm/blob/master/src/libfm/src/fm_learn_mcmc_simultaneous.h#L243, while the train score is mainly calculated here: https://github.com/srendle/libfm/blob/master/src/libfm/src/fm_learn_mcmc_simultaneous.h#L170-L172. The code path seems indeed different. So, what happens in the code path that computes the accuracy for the training set?
from libfm.
libFM uses a few tricks like clipping prediction to highest / lowest vales. Maybe one of this tricks in only applied to the test predictions.
from libfm.
The printed train accuracy is calculated for one MCMC draw. The test accuracy over all draws (i.e., an average). I agree that this is misleading and both measures should report either the average or one draw.
In general, I would recommend to look at the log-file and not at std::out. The log file is more verbose and reports all test-values: one draw, all draws, all but 5 draws. It contains loglikelihood and accuracy for these measures.
from libfm.
Thanks for the elaboration. I'll take a look at the log file to see if I understand it.
from libfm.
where can download train and test data?I can only find movie,rating,user,tags data on movielens.
from libfm.
Related Issues (20)
- Error while using als method HOT 1
- terminal bash isn't known HOT 2
- Strange results on hello world example HOT 5
- Factorization Machines Query HOT 6
- Does the Block Structure Extension work for method=SGD? HOT 1
- Can I load model from file to predict new items without training? HOT 2
- Load/save for other models than SGD/ALS HOT 2
- Checking the weights of the model HOT 7
- 自适应部分权重是否进行了二次更新
- Custom loss functions
- some confusion HOT 2
- feature design and test set values: "Other Movies Rated" HOT 1
- How to implement MCMC HOT 2
- save_model parameter not found
- Is there any way to load training data from HDFS?
- Convert Data with mixed datatypes to LibSVM format
- Kitty Terminal Emulator
- can it be used in python? HOT 1
- umm... so i reported
- CPU Usage multiple threading
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from libfm.