Comments (10)
@sgao3 Yeah exactly, for me too. Closing this issue.
from dlrm.
The accuracy is initially high because the dataset is very skewed in terms of number of clicks/non-clicks (~3%/97%). So, the model can always guess non-click and still attain a high accuracy.
If you are not sub-sampling the non-clicks, I would suggest using --mlperf-logging flag that would allow you to track multiple metrics, including AUC, which is better suited for this situation.
ps. you should also add --test-freq=10240 to see the test metrics (i.e. AUC) at fixed intervals.
from dlrm.
Thanks, @mnaumovfb for the explanation. Nevertheless, I'm looking to reproduce these plots, that were generated using the suggested following command line (+adding the flage for sub-sampling), according to the instructions:
--arch-sparse-feature-size=64 --arch-mlp-bot="13-512-256-64" --arch-mlp-top="512-512-256-1" --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --raw-data-file=./data/day --processed-data-file=./input/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=2048 --print-freq=1024 --print-time --test-mini-batch-size=16384 --test-num-workers=16 --use-gpu --test-freq=10240 --memory-map --data-sub-sample-rate=0.875
However, the training accuracy in the plot starts with ~79% and the training loss with 0.48, while in my run it still starts with the following:
Finished training it 1024/2048437 of epoch 0, 63.01 ms/it, loss 0.138724, accuracy 96.726 %
Can you please suggest how can I reproduce the exact result as in the plots?
Thanks again.
from dlrm.
In your original command above you do not have --data-sub-sample-rate=0.875 flag. I suspect that to be the reason for the high accuracy (i.e. you are running on the full dataset). If the flag is added I would expect he accuracy to drop and match the one reported in the README.
from dlrm.
@mnaumovfb, but I am using this flag. I added it after your first comment. the command line I use is:
--arch-sparse-feature-size=64 --arch-mlp-bot="13-512-256-64" --arch-mlp-top="512-512-256-1" --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --raw-data-file=./data/day --processed-data-file=./input/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=2048 --print-freq=1024 --print-time --test-mini-batch-size=16384 --test-num-workers=16 --use-gpu --test-freq=10240 --memory-map --data-sub-sample-rate=0.875
but still, the accuracy is very high. This is after 1024 steps:
Finished training it 1024/2048437 of epoch 0, 63.01 ms/it, loss 0.138724, accuracy 96.726 %
from dlrm.
This flag affects the pre-processing of the dataset itself. If you did not have it during pre-processing and then just add it to the command line while training it will have no effect. Is that what you are doing or did you add it from the beginning?
from dlrm.
I added it after the pre-processing.
So let's say I don't want to start the preprocessing from the beginning, what convergence pattern should I expect? very high at the beginning and then dropping?
from dlrm.
If you simply run the full dataset with above options the accuracy metric will just oscillate around 97%. Therefore, when running with full dataset it is more meaningful to look at AUC metric. You can obtain it by adding --mlperf-logging and --test-freq flags as I have mentioned earlier.
There is a caveat, though. By default the --mlperf-logging flag uses a different loader, so to switch to the default loader, which you have already used for pre-processing, you have to change the if statement on line 381 in dlrm_datat_pytorch to "if False:". I think this might work, but I do not guarantee it.
from dlrm.
Thank you, I will try that.
from dlrm.
Just FYI, I had the same issue when I saw this thread. I tried the suggestion mlperf option, but I still saw the 96.x% accuracy at the beginning of the training.
As the sub-sampled data is roughly 1-0.875 of original, I regenerated the dataset and training the model by using 2~3 days (1GPU), the output accuracy seems similar to the plot included in the repo. The commit I am using is 4705ea1
The following is the first 20 iterations and 2 test outputs I got.
data file: ./input/terabyte_processed_train.bin number of batches: 315310
data file: ./input/terabyte_processed_test.bin number of batches: 840
time/loss/accuracy (if enabled):
Finished training it 1024/315310 of epoch 0, 15.12 ms/it, loss 0.476438, accuracy 78.931 %
Finished training it 2048/315310 of epoch 0, 13.42 ms/it, loss 0.466592, accuracy 79.291 %
Finished training it 3072/315310 of epoch 0, 12.92 ms/it, loss 0.460231, accuracy 79.614 %
Finished training it 4096/315310 of epoch 0, 12.99 ms/it, loss 0.453668, accuracy 79.920 %
Finished training it 5120/315310 of epoch 0, 13.00 ms/it, loss 0.448201, accuracy 80.160 %
Finished training it 6144/315310 of epoch 0, 12.89 ms/it, loss 0.444909, accuracy 80.307 %
Finished training it 7168/315310 of epoch 0, 12.84 ms/it, loss 0.443121, accuracy 80.368 %
Finished training it 8192/315310 of epoch 0, 12.80 ms/it, loss 0.441182, accuracy 80.438 %
Finished training it 9216/315310 of epoch 0, 12.86 ms/it, loss 0.439022, accuracy 80.522 %
Finished training it 10240/315310 of epoch 0, 12.80 ms/it, loss 0.437467, accuracy 80.612 %
Testing at - 10240/315310 of epoch 0, loss 0.441341, recall 0.2560, precision 0.6271, f1 0.3635, ap 0.5101, auc 0.7713, best auc 0.7713, accuracy 80.243 %, best accuracy 80.243 %
Finished training it 11264/315310 of epoch 0, 14.96 ms/it, loss 0.436469, accuracy 80.659 %
Finished training it 12288/315310 of epoch 0, 13.02 ms/it, loss 0.436003, accuracy 80.652 %
Finished training it 13312/315310 of epoch 0, 13.19 ms/it, loss 0.434785, accuracy 80.723 %
Finished training it 14336/315310 of epoch 0, 12.94 ms/it, loss 0.434305, accuracy 80.735 %
Finished training it 15360/315310 of epoch 0, 12.92 ms/it, loss 0.433218, accuracy 80.794 %
Finished training it 16384/315310 of epoch 0, 12.84 ms/it, loss 0.433219, accuracy 80.782 %
Finished training it 17408/315310 of epoch 0, 12.84 ms/it, loss 0.432135, accuracy 80.817 %
Finished training it 18432/315310 of epoch 0, 12.87 ms/it, loss 0.432788, accuracy 80.769 %
Finished training it 19456/315310 of epoch 0, 12.86 ms/it, loss 0.431306, accuracy 80.873 %
Finished training it 20480/315310 of epoch 0, 12.88 ms/it, loss 0.431068, accuracy 80.876 %
Testing at - 20480/315310 of epoch 0, loss 0.437152, recall 0.3036, precision 0.6141, f1 0.4063, ap 0.5224, auc 0.7792, best auc 0.7792, accuracy 80.442 %, best accuracy 80.442 %
from dlrm.
Related Issues (20)
- Embedding values in different training environment HOT 2
- RuntimeError: [enforce fail at embedding_lookup_idx.cc:215] current == index_size. 0 vs -1. Your input seems to be incorrect: the sum of lengths values should be the size of the indices tensor, but it appears not. HOT 2
- How does pytorch handle backward pass in a multi-GPU setting? HOT 2
- how to train dlrm with multi-gpu HOT 1
- What is the training cycle of DLRM? HOT 4
- torchrec: Super slow allreduce in multi-node multi-gpu setting HOT 3
- Embedding_bag operator on GPU HOT 1
- fail to run dlrm_s_pytorch.py on single node multiple GPUs with nccl HOT 1
- python3 HOT 1
- how to inference ./dlrm_s_criteo_kaggle.sh HOT 1
- Getting Nan loss when training dlrm with Kaggle Criteo dataset HOT 7
- Question regarding the pooling in QR trick HOT 1
- Issue with Activating UVM Function in torchrec_dlrm
- Size of embedding tables in MLPerf checkpoint
- Segmentation Fault on M1 Mac. HOT 1
- Help with installation. No module named caffe2
- Docker Build failing HOT 1
- Unable to preprocess Criteo Kaggle Display Advertising Challenge Dataset
- What's the purpose of torchrec_dlrm/?
- Correctly changing precision in DLRM HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dlrm.