harvard-acc / deeprecsys Goto Github PK
View Code? Open in Web Editor NEWhttp://vlsiarch.eecs.harvard.edu/research/recommendation/
License: MIT License
http://vlsiarch.eecs.harvard.edu/research/recommendation/
License: MIT License
Is it possible to add LICENSE file which explains which license this code is released under? This could simplify contributions to DeepRecSys project.
Dear authors,
After the discussion with you last week, I had a chance to do some experiments on larger models. However, the results do not align well with my expectations.
The configs are as follows:
{
"arch_mlp_bot": "13-256-32",
"arch_mlp_top": "256-128-1",
"arch_embedding_size": "39043-17289-7420-20263-7120-403346-976-147-39979772-25641295-39664985-585935",
"arch_sparse_feature_size": 32,
"num_indices_per_lookup_fixed": true,
"num_indices_per_lookup": 35,
"arch_interaction_op": "cat",
"model_type": "dlrm",
"model_name": "rm1"
}
{
"arch_mlp_bot": "13-512-256-64",
"arch_mlp_top": "512-512-256-1",
"arch_embedding_size": "39884407-39043-17289-7420-20263-315-7120-1543-263-38532952-2953546-403346-108-2208-11938-155-408-976-147-39979772-25641295-39664985-585935-12972-108-236",
"arch_sparse_feature_size": 64,
"num_indices_per_lookup_fixed": true,
"num_indices_per_lookup": 35,
"arch_interaction_op": "cat",
"model_type": "dlrm",
"model_name": "rm1"
}
{
"arch_mlp_bot": "2560-1024-256-32",
"arch_mlp_top": "512-256-1",
"arch_embedding_size": "39884407-39043-17289-7420-20263-310-7120-1543-263-155-400-976-147-39979772-25641295-39664985-58 5935-12972-108-236",
"arch_sparse_feature_size": 32,
"num_indices_per_lookup_fixed": true,
"num_indices_per_lookup": 20,
"arch_interaction_op": "cat",
"model_type": "dlrm",
"model_name": "rm3"
}
You can see there are 13.6GB, 48.1GB, 18.6GB embedding tables on RMC-1, RMC-2, and RMC-3 respectively. I think RMC-2 would spend more time on SLS as its embedding tables are really large compared to the FC layers. But the results are different.
For your reference, it is a piece of log when batch size is 256 on RMC-2:
Time per operator type:
18.6277 ms. 70.1539%. FC
6.78981 ms. 25.5711%. SparseLengthsSum
0.76081 ms. 2.86529%. Concat
0.369324 ms. 1.39091%. Relu
0.004991 ms. 0.0187966%. Sigmoid
26.5526 ms in Total
FLOP per operator type:
0.733889 GFLOP. 98.0089%. FC
0.0149094 GFLOP. 1.99112%. SparseLengthsSum
0 GFLOP. 0%. Concat
0 GFLOP. 0%. Relu
0.748798 GFLOP in Total
Feature Memory Read per operator type:
60.5962 MB. 81.7255%. SparseLengthsSum
9.61767 MB. 12.9712%. FC
2.16269 MB. 2.91679%. Relu
1.76948 MB. 2.38647%. Concat
74.1461 MB in Total
Feature Memory Written per operator type:
2.16371 MB. 35.4947%. FC
2.16269 MB. 35.4779%. Relu
1.76947 MB. 29.0274%. Concat
0 MB. 0%. SparseLengthsSum
6.09587 MB in Total
Parameter Memory per operator type:
48068.8 MB. 99.9881%. SparseLengthsSum
5.73773 MB. 0.0119351%. FC
0 MB. 0%. Concat
0 MB. 0%. Relu
48074.5 MB in Total
I know compared to FC, SLS's compute intensity is really low, so when it still takes 25% of the total time, we should regard it as a considerable amount. But this figure is very different from that in your paper, so I am wondering what was wrong with this. Is it that I configured RMC-2 wrongly?
My CPU info is as below:
architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Stepping: 4
CPU MHz: 899.926
CPU max MHz: 3000.0000
CPU min MHz: 800.0000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d
Thanks!
Lines 190 to 191 in f07ef3f
Should this be in a while loop?
From what I understand, each inference engine should signal the ReadyQueue only once, just as in accelInferenceEngine
for example:
DeepRecSys/accelInferenceEngine.py
Lines 33 to 34 in f07ef3f
Also, the queue is consumed for exactly args.inference_engines
number of times, and is not consumed anymore afterwards
Lines 76 to 78 in f07ef3f
It seems like the statement in the while loop will keep piling elements in the ReadyQueue for each request that it handles, and if it reaches the maximum queue size (32767?) it might cause some sort of a deadlock, it seems.
Please confirm whether I understood correctly.
Thanks.
Line 113 seems redundant.
sparse_group.size == sparse_group_size
is always (will eventually be) true, following the while
on line 108.
DeepRecSys/data_generator/dlrm_data_caffe2.py
Line 113 in 1383176
Hello,
I am trying to use DeepRecSys on a machine that has installed Ubuntu 18.04, but I have problems when trying to install the dependencies (pip3 install -r build/pip_requirements.txt) since some specific package versions are not found. I also have problems to install caffe2.
Would DeepRecSys work on a Ubuntu 18.04 machine or should I try on another version?
Thanks,
Ivan
Hi,
As far as I know, it does not support the inference on a real GPU. The code used in this open source does not take the actual GPU inference. Instead, it mimics the GPU inference by waiting a fixed time based on an analytic model. Is there a reason not to use an actual GPU? Since DLRM has supported the GPU inference, I am wondering why this version does exclude the GPU side code.
Thanks,
Jeongseob
Hi,
Thanks a lot for the useful infrastructure and high-quality code. If we want to generate synthetic data using the data generators, we need to have a workload file or the "sd_cumm" and "sd_prob" files can be used as the input of "trace_profile.py"? Or "sd_cumm" and "sd_prob" are the results of a trace profiling that you have already done?
I found in the config file, the embedding table sizes of RMC-1, RMC-2, RMC-3 are 4GB, 4GB, and 2GB, which are relatively small, as we are told the embedding tables are usually on the order of tens of GBs. (Actually, I am also a bit confused, because, in your another paper, you said the sizes should be 100MB, 10GB, and 1GB, still far from large). These make me wonder whether results on these small models are true on large models in the industry.
I know you have tested in Facebook's production line and things worked well, but as we do not have access to these real things, I would be appreciated it if you can share any insights on this. Thanks!
After I generated synthetic trace using data_generator/trace_generator.py, I am trying to run the dlrm_s_caffe2 in synthetic mode (using the command below). It keeps compiling about "'Namespace' object has no attribute 'data_size'". If I try to pass some value by --data_size, it doesn't recognize this argument.
python dlrm_s_caffe2.py --inference_only --caffe2_net_type async_dag --config_file "configs/dlrm_rm2.json" --engine "prof_dag" --nepochs 100 --data_generation "synthetic" --round_targets 10 --num_batches 512 --mini_batch_size 16 --max_mini_batch_size 16 --data_trace_file "../data_generator/syn_traces/tbl1"
Am I doing something wrong? Can you provide a sample command line to run dlrm with synthetic inputs?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.