Giter Site home page Giter Site logo

deeprecsys's People

Contributors

alugupta avatar devanshudesai avatar huwan avatar sungmincho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deeprecsys's Issues

License information

Is it possible to add LICENSE file which explains which license this code is released under? This could simplify contributions to DeepRecSys project.

operator breakdown on larger models

Dear authors,

After the discussion with you last week, I had a chance to do some experiments on larger models. However, the results do not align well with my expectations.

The configs are as follows:

{
    "arch_mlp_bot": "13-256-32",
    "arch_mlp_top": "256-128-1",
    "arch_embedding_size": "39043-17289-7420-20263-7120-403346-976-147-39979772-25641295-39664985-585935", 
    "arch_sparse_feature_size": 32,
    "num_indices_per_lookup_fixed": true,
    "num_indices_per_lookup": 35,
    "arch_interaction_op": "cat",
    "model_type": "dlrm",
    "model_name": "rm1"
}

{
    "arch_mlp_bot": "13-512-256-64",
    "arch_mlp_top": "512-512-256-1",
    "arch_embedding_size": "39884407-39043-17289-7420-20263-315-7120-1543-263-38532952-2953546-403346-108-2208-11938-155-408-976-147-39979772-25641295-39664985-585935-12972-108-236",
    "arch_sparse_feature_size": 64,
    "num_indices_per_lookup_fixed": true,
    "num_indices_per_lookup": 35,
    "arch_interaction_op": "cat",
    "model_type": "dlrm",
    "model_name": "rm1"
}

{
    "arch_mlp_bot": "2560-1024-256-32",
    "arch_mlp_top": "512-256-1",
    "arch_embedding_size": "39884407-39043-17289-7420-20263-310-7120-1543-263-155-400-976-147-39979772-25641295-39664985-58 5935-12972-108-236",
    "arch_sparse_feature_size": 32,
    "num_indices_per_lookup_fixed": true,
    "num_indices_per_lookup": 20,
    "arch_interaction_op": "cat",
    "model_type": "dlrm",
    "model_name": "rm3"
}

You can see there are 13.6GB, 48.1GB, 18.6GB embedding tables on RMC-1, RMC-2, and RMC-3 respectively. I think RMC-2 would spend more time on SLS as its embedding tables are really large compared to the FC layers. But the results are different.

The breakdown figure is:
image

For your reference, it is a piece of log when batch size is 256 on RMC-2:

Time per operator type:
        18.6277 ms.    70.1539%. FC
        6.78981 ms.    25.5711%. SparseLengthsSum
        0.76081 ms.    2.86529%. Concat
       0.369324 ms.    1.39091%. Relu
       0.004991 ms.  0.0187966%. Sigmoid
        26.5526 ms in Total
FLOP per operator type:
       0.733889 GFLOP.    98.0089%. FC
      0.0149094 GFLOP.    1.99112%. SparseLengthsSum
              0 GFLOP.          0%. Concat
              0 GFLOP.          0%. Relu
       0.748798 GFLOP in Total
Feature Memory Read per operator type:
        60.5962 MB.    81.7255%. SparseLengthsSum
        9.61767 MB.    12.9712%. FC
        2.16269 MB.    2.91679%. Relu
        1.76948 MB.    2.38647%. Concat
        74.1461 MB in Total
Feature Memory Written per operator type:
        2.16371 MB.    35.4947%. FC
        2.16269 MB.    35.4779%. Relu
        1.76947 MB.    29.0274%. Concat
              0 MB.          0%. SparseLengthsSum
        6.09587 MB in Total
Parameter Memory per operator type:
        48068.8 MB.    99.9881%. SparseLengthsSum
        5.73773 MB.  0.0119351%. FC
              0 MB.          0%. Concat
              0 MB.          0%. Relu
        48074.5 MB in Total

I know compared to FC, SLS's compute intensity is really low, so when it still takes 25% of the total time, we should regard it as a considerable amount. But this figure is very different from that in your paper, so I am wondering what was wrong with this. Is it that I configured RMC-2 wrongly?

My CPU info is as below:

architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Stepping:              4
CPU MHz:               899.926
CPU max MHz:           3000.0000
CPU min MHz:           800.0000
BogoMIPS:              4400.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              14080K
NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d

Thanks!

`inferenceEngineReadyQueue.put(True)` in a while loop?

while True:
inferenceEngineReadyQueue.put(True)

Should this be in a while loop?

From what I understand, each inference engine should signal the ReadyQueue only once, just as in accelInferenceEngine for example:

else:
inferenceEngineReadyQueue.put(True)

Also, the queue is consumed for exactly args.inference_engines number of times, and is not consumed anymore afterwards

while ready_engines < args.inference_engines:
inferenceEngineReadyQueue.get()
ready_engines += 1

It seems like the statement in the while loop will keep piling elements in the ReadyQueue for each request that it handles, and if it reaches the maximum queue size (32767?) it might cause some sort of a deadlock, it seems.

Please confirm whether I understood correctly.

Thanks.

Requeriments

Hello,

I am trying to use DeepRecSys on a machine that has installed Ubuntu 18.04, but I have problems when trying to install the dependencies (pip3 install -r build/pip_requirements.txt) since some specific package versions are not found. I also have problems to install caffe2.

Would DeepRecSys work on a Ubuntu 18.04 machine or should I try on another version?

Thanks,
Ivan

supporting the inference on a real GPU

Hi,

As far as I know, it does not support the inference on a real GPU. The code used in this open source does not take the actual GPU inference. Instead, it mimics the GPU inference by waiting a fixed time based on an analytic model. Is there a reason not to use an actual GPU? Since DLRM has supported the GPU inference, I am wondering why this version does exclude the GPU side code.

Thanks,
Jeongseob

Synthetic data generator

Hi,

Thanks a lot for the useful infrastructure and high-quality code. If we want to generate synthetic data using the data generators, we need to have a workload file or the "sd_cumm" and "sd_prob" files can be used as the input of "trace_profile.py"? Or "sd_cumm" and "sd_prob" are the results of a trace profiling that you have already done?

Can these implemented models represent real scenarios?

I found in the config file, the embedding table sizes of RMC-1, RMC-2, RMC-3 are 4GB, 4GB, and 2GB, which are relatively small, as we are told the embedding tables are usually on the order of tens of GBs. (Actually, I am also a bit confused, because, in your another paper, you said the sizes should be 100MB, 10GB, and 1GB, still far from large). These make me wonder whether results on these small models are true on large models in the industry.

I know you have tested in Facebook's production line and things worked well, but as we do not have access to these real things, I would be appreciated it if you can share any insights on this. Thanks!

Running DLRM with synthetic inputs

After I generated synthetic trace using data_generator/trace_generator.py, I am trying to run the dlrm_s_caffe2 in synthetic mode (using the command below). It keeps compiling about "'Namespace' object has no attribute 'data_size'". If I try to pass some value by --data_size, it doesn't recognize this argument.

python dlrm_s_caffe2.py --inference_only --caffe2_net_type async_dag --config_file "configs/dlrm_rm2.json" --engine "prof_dag" --nepochs 100 --data_generation "synthetic" --round_targets 10 --num_batches 512 --mini_batch_size 16 --max_mini_batch_size 16 --data_trace_file "../data_generator/syn_traces/tbl1"

Am I doing something wrong? Can you provide a sample command line to run dlrm with synthetic inputs?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.