I've had a hard time trying to reproduce the results. Listed are what I've tried.
- I've re-organized the code in the way I'm used to, and run experiments on CASME_sq using the features extracted by myself as instructed. The overall F1-score is somewhere around 0.23. So I doubt if there's something wrong with my feature extraction procedure, so I turn to the preprocessed features offered in the repo.
- Run experiments on CASME_sq using the features you provided in repo.
Results:
Final result: TP:101, FP:290, FN:256
Precision = 0.2583
Recall = 0.185
F1-Score = 0.2156
The results are still not so good. So I finally tried to run the code in the jupyter notebook in provided in #3
- Run experiments on CASME_sq & SAMMLV using the notebook & features you provided in repo. Here are the results.
Reproduction ipynb:
CASME:
Micro result: TP:3 FP:137 FN:54 F1_score:0.0305
Macro result: TP:100 FP:206 FN:200 F1_score:0.3300
Overall result: TP:103 FP:343 FN:254 F1_score:0.2565
SAMMLV:
Cumulative result until subject 30:
Micro result: TP:10 FP:169 FN:149 F1_score:0.0592
Macro result: TP:97 FP:277 FN:246 F1_score:0.2706
Overall result: TP:107 FP:446 FN:395 F1_score:0.2028
Orig ipynb:
CASME:
Micro result: TP:5 FP:77 FN:52 F1_score:0.0719
Macro result: TP:108 FP:166 FN:192 F1_score:0.3763
Overall result: TP:113 FP:243 FN:244 F1_score:0.3170
SAMM:
Micro result: TP:12 FP:104 FN:147 F1_score:0.0873
Macro result: TP:88 FP:198 FN:255 F1_score:0.2798
Overall result: TP:100 FP:302 FN:402 F1_score:0.2212
As reported above, there's a huge gap between the reproduction result & orig. performance on CASME_sq, while the gap for SAMMLV dataset is much smaller.
I've also tried fixing the ransom seed=1, the result does not improve, and replacing the mix of hard&soft label loss by pure hard label loss improves results. Moreover, I notive there are many subtle differences between the orig. code & jupyter notebook, using spotting method in the orig. code produces very bad results:
Final result: TP:53, FP:320, FN:304
Precision = 0.1421
Recall = 0.0849
F1-Score = 0.1063
Replacing it by the spotting method in the jupyter notebook turns out better, with results:
Final result: TP:102, FP:299, FN:255
Precision = 0.2544
Recall = 0.1841
F1-Score = 0.2136
And I found a typo in the orig. code
|
if end-start > macro_min and end-start < macro_max and ( score_plot_micro[peak] > 0.95 or (score_plot_macro[peak] > score_plot_macro[start] and score_plot_macro[peak] > score_plot_macro[end])): |
I believe
score_plot_micro[peak] > 0.95
should be
score_plot_macro[peak] > 0.95
I'm trying to make some improvement on your work and take that as a baseline model, but I'm veru frustrated by the reproduction result. Any insight/help would be very precious to me.