Thanks for the awesome work! I am interested in training the model w

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[question] Plan for Alibaba Job Generator about decima-sim HOT 6 CLOSED

hongzimao commented on July 20, 2024

[question] Plan for Alibaba Job Generator

from decima-sim.

Comments (6)

hongzimao commented on July 20, 2024

Thanks for your interest! Our evaluation was on an earlier private version of the Alibaba's public trace (see footnote 3 in the paper). In principle, we can also test on the current version. You just need to load the job DAGs from the trace by specifying the task durations of each stage https://github.com/hongzimao/decima-sim/blob/master/spark_env/job_generator.py#L37-L45 (the real world trace might not have it, but we measure the average task runtime with different degree of parallelism, see section 6.2 point 3 in the paper) and follow the trace for inter job arrival time (TPC-H uses a Poisson arrival https://github.com/hongzimao/decima-sim/blob/master/spark_env/job_generator.py#L129). Hope this helps?

from decima-sim.

gaocegege commented on July 20, 2024

@hongzimao

I am trying to use the Decima framework to train a scheduler to schedule Distributed DL training jobs on Kubernetes. Thanks for your reply. It works for me.

There is one last question: when you evaluate Alibaba private trace, do you use Poisson distribution for arrival interval or get the real distribution from the trace? I intend to get the real distribution but not sure if it is possible. Thus really appreciate if you could give me some advice.

Thanks.

from decima-sim.

hongzimao commented on July 20, 2024

The trace should provide the arrival time of each job. We replayed the trace (replace the Poisson inter arrival time with actual time) in our evaluation.

from decima-sim.

gaocegege commented on July 20, 2024

Gotcha, Thanks a lot.

from decima-sim.

hongzimao commented on July 20, 2024

Optimizing distributed DL training with Kubernetes sounds very interesting! Please keep us updated for your progress. One of my colleagues recently also applied a similar graph neural network technique to Tensorflow device placement problem. The paper was just accepted to NIPS: https://arxiv.org/pdf/1906.08879.pdf if you are interested.

from decima-sim.

gaocegege commented on July 20, 2024

Yeah I have read the paper about Placeto. I am not very familiar with GNN but the result is really promising.

Distributed DL training jobs have some different characteristics from DAG jobs. [1] Not sure if RL + GNN or RNN approach has great improvement in this scenario. But of course, if we have any progress, I can share here.

[1] https://arxiv.org/pdf/1901.05758.pdf

from decima-sim.

Recommend Projects

[question] Plan for Alibaba Job Generator about decima-sim HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent