harvard-acc / deeprecsys Goto Github PK

View Code? Open in Web Editor NEW

126.0 126.0 24.0 288 KB

http://vlsiarch.eecs.harvard.edu/research/recommendation/

License: MIT License

Python 93.69% Shell 6.31%

deeprecsys's People

Contributors

Stargazers

Watchers

deeprecsys's Issues

License information

Is it possible to add LICENSE file which explains which license this code is released under? This could simplify contributions to DeepRecSys project.

operator breakdown on larger models

Dear authors,

After the discussion with you last week, I had a chance to do some experiments on larger models. However, the results do not align well with my expectations.

The configs are as follows:

{
    "arch_mlp_bot": "13-256-32",
    "arch_mlp_top": "256-128-1",
    "arch_embedding_size": "39043-17289-7420-20263-7120-403346-976-147-39979772-25641295-39664985-585935", 
    "arch_sparse_feature_size": 32,
    "num_indices_per_lookup_fixed": true,
    "num_indices_per_lookup": 35,
    "arch_interaction_op": "cat",
    "model_type": "dlrm",
    "model_name": "rm1"
}

{
    "arch_mlp_bot": "13-512-256-64",
    "arch_mlp_top": "512-512-256-1",
    "arch_embedding_size": "39884407-39043-17289-7420-20263-315-7120-1543-263-38532952-2953546-403346-108-2208-11938-155-408-976-147-39979772-25641295-39664985-585935-12972-108-236",
    "arch_sparse_feature_size": 64,
    "num_indices_per_lookup_fixed": true,
    "num_indices_per_lookup": 35,
    "arch_interaction_op": "cat",
    "model_type": "dlrm",
    "model_name": "rm1"
}

{
    "arch_mlp_bot": "2560-1024-256-32",
    "arch_mlp_top": "512-256-1",
    "arch_embedding_size": "39884407-39043-17289-7420-20263-310-7120-1543-263-155-400-976-147-39979772-25641295-39664985-58 5935-12972-108-236",
    "arch_sparse_feature_size": 32,
    "num_indices_per_lookup_fixed": true,
    "num_indices_per_lookup": 20,
    "arch_interaction_op": "cat",
    "model_type": "dlrm",
    "model_name": "rm3"
}

You can see there are 13.6GB, 48.1GB, 18.6GB embedding tables on RMC-1, RMC-2, and RMC-3 respectively. I think RMC-2 would spend more time on SLS as its embedding tables are really large compared to the FC layers. But the results are different.

The breakdown figure is:

For your reference, it is a piece of log when batch size is 256 on RMC-2:

Time per operator type:
        18.6277 ms.    70.1539%. FC
        6.78981 ms.    25.5711%. SparseLengthsSum
        0.76081 ms.    2.86529%. Concat
       0.369324 ms.    1.39091%. Relu
       0.004991 ms.  0.0187966%. Sigmoid
        26.5526 ms in Total
FLOP per operator type:
       0.733889 GFLOP.    98.0089%. FC
      0.0149094 GFLOP.    1.99112%. SparseLengthsSum
              0 GFLOP.          0%. Concat
              0 GFLOP.          0%. Relu
       0.748798 GFLOP in Total
Feature Memory Read per operator type:
        60.5962 MB.    81.7255%. SparseLengthsSum
        9.61767 MB.    12.9712%. FC
        2.16269 MB.    2.91679%. Relu
        1.76948 MB.    2.38647%. Concat
        74.1461 MB in Total
Feature Memory Written per operator type:
        2.16371 MB.    35.4947%. FC
        2.16269 MB.    35.4779%. Relu
        1.76947 MB.    29.0274%. Concat
              0 MB.          0%. SparseLengthsSum
        6.09587 MB in Total
Parameter Memory per operator type:
        48068.8 MB.    99.9881%. SparseLengthsSum
        5.73773 MB.  0.0119351%. FC
              0 MB.          0%. Concat
              0 MB.          0%. Relu
        48074.5 MB in Total

I know compared to FC, SLS's compute intensity is really low, so when it still takes 25% of the total time, we should regard it as a considerable amount. But this figure is very different from that in your paper, so I am wondering what was wrong with this. Is it that I configured RMC-2 wrongly?

My CPU info is as below:

architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Stepping:              4
CPU MHz:               899.926
CPU max MHz:           3000.0000
CPU min MHz:           800.0000
BogoMIPS:              4400.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              14080K
NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d

Thanks!

`inferenceEngineReadyQueue.put(True)` in a while loop?

DeepRecSys/inferenceEngine.py

Lines 190 to 191 in f07ef3f

    
           while True: 
        
             inferenceEngineReadyQueue.put(True)

Should this be in a while loop?

From what I understand, each inference engine should signal the ReadyQueue only once, just as in accelInferenceEngine for example:

DeepRecSys/accelInferenceEngine.py

Lines 33 to 34 in f07ef3f

    
           else: 
        
             inferenceEngineReadyQueue.put(True)

Also, the queue is consumed for exactly args.inference_engines number of times, and is not consumed anymore afterwards

DeepRecSys/loadGenerator.py

Lines 76 to 78 in f07ef3f

    
           while ready_engines < args.inference_engines: 
        
             inferenceEngineReadyQueue.get() 
        
             ready_engines += 1

It seems like the statement in the while loop will keep piling elements in the ReadyQueue for each request that it handles, and if it reaches the maximum queue size (32767?) it might cause some sort of a deadlock, it seems.

Please confirm whether I understood correctly.

Thanks.

Refactor "sparse_group_size = np.int32(sparse_group.size)"?

Line 113 seems redundant.
sparse_group.size == sparse_group_size is always (will eventually be) true, following the while on line 108.

DeepRecSys/data_generator/dlrm_data_caffe2.py

Line 113 in 1383176

sparse_group_size = np.int32(sparse_group.size)

Requeriments

Hello,

I am trying to use DeepRecSys on a machine that has installed Ubuntu 18.04, but I have problems when trying to install the dependencies (pip3 install -r build/pip_requirements.txt) since some specific package versions are not found. I also have problems to install caffe2.

Would DeepRecSys work on a Ubuntu 18.04 machine or should I try on another version?

Thanks,
Ivan

supporting the inference on a real GPU

Hi,

As far as I know, it does not support the inference on a real GPU. The code used in this open source does not take the actual GPU inference. Instead, it mimics the GPU inference by waiting a fixed time based on an analytic model. Is there a reason not to use an actual GPU? Since DLRM has supported the GPU inference, I am wondering why this version does exclude the GPU side code.

Thanks,
Jeongseob

Synthetic data generator

Hi,

Thanks a lot for the useful infrastructure and high-quality code. If we want to generate synthetic data using the data generators, we need to have a workload file or the "sd_cumm" and "sd_prob" files can be used as the input of "trace_profile.py"? Or "sd_cumm" and "sd_prob" are the results of a trace profiling that you have already done?

Can these implemented models represent real scenarios?

I found in the config file, the embedding table sizes of RMC-1, RMC-2, RMC-3 are 4GB, 4GB, and 2GB, which are relatively small, as we are told the embedding tables are usually on the order of tens of GBs. (Actually, I am also a bit confused, because, in your another paper, you said the sizes should be 100MB, 10GB, and 1GB, still far from large). These make me wonder whether results on these small models are true on large models in the industry.

I know you have tested in Facebook's production line and things worked well, but as we do not have access to these real things, I would be appreciated it if you can share any insights on this. Thanks!

Running DLRM with synthetic inputs

After I generated synthetic trace using data_generator/trace_generator.py, I am trying to run the dlrm_s_caffe2 in synthetic mode (using the command below). It keeps compiling about "'Namespace' object has no attribute 'data_size'". If I try to pass some value by --data_size, it doesn't recognize this argument.

python dlrm_s_caffe2.py --inference_only --caffe2_net_type async_dag --config_file "configs/dlrm_rm2.json" --engine "prof_dag" --nepochs 100 --data_generation "synthetic" --round_targets 10 --num_batches 512 --mini_batch_size 16 --max_mini_batch_size 16 --data_trace_file "../data_generator/syn_traces/tbl1"

Am I doing something wrong? Can you provide a sample command line to run dlrm with synthetic inputs?

harvard-acc / deeprecsys Goto Github PK

deeprecsys's People

Contributors

Stargazers

Watchers

Forkers

deeprecsys's Issues

License information

operator breakdown on larger models

`inferenceEngineReadyQueue.put(True)` in a while loop?

Refactor "sparse_group_size = np.int32(sparse_group.size)"?

Requeriments

supporting the inference on a real GPU

Synthetic data generator

Can these implemented models represent real scenarios?

Running DLRM with synthetic inputs

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	while ready_engines < args.inference_engines:
	inferenceEngineReadyQueue.get()
	ready_engines += 1