Giter Site home page Giter Site logo

Comments (6)

hongzimao avatar hongzimao commented on July 20, 2024

Thanks for your interest! Our evaluation was on an earlier private version of the Alibaba's public trace (see footnote 3 in the paper). In principle, we can also test on the current version. You just need to load the job DAGs from the trace by specifying the task durations of each stage https://github.com/hongzimao/decima-sim/blob/master/spark_env/job_generator.py#L37-L45 (the real world trace might not have it, but we measure the average task runtime with different degree of parallelism, see section 6.2 point 3 in the paper) and follow the trace for inter job arrival time (TPC-H uses a Poisson arrival https://github.com/hongzimao/decima-sim/blob/master/spark_env/job_generator.py#L129). Hope this helps?

from decima-sim.

gaocegege avatar gaocegege commented on July 20, 2024

@hongzimao

I am trying to use the Decima framework to train a scheduler to schedule Distributed DL training jobs on Kubernetes. Thanks for your reply. It works for me.

There is one last question: when you evaluate Alibaba private trace, do you use Poisson distribution for arrival interval or get the real distribution from the trace? I intend to get the real distribution but not sure if it is possible. Thus really appreciate if you could give me some advice.

Thanks.

from decima-sim.

hongzimao avatar hongzimao commented on July 20, 2024

The trace should provide the arrival time of each job. We replayed the trace (replace the Poisson inter arrival time with actual time) in our evaluation.

from decima-sim.

gaocegege avatar gaocegege commented on July 20, 2024

Gotcha, Thanks a lot.

from decima-sim.

hongzimao avatar hongzimao commented on July 20, 2024

Optimizing distributed DL training with Kubernetes sounds very interesting! Please keep us updated for your progress. One of my colleagues recently also applied a similar graph neural network technique to Tensorflow device placement problem. The paper was just accepted to NIPS: https://arxiv.org/pdf/1906.08879.pdf if you are interested.

from decima-sim.

gaocegege avatar gaocegege commented on July 20, 2024

Yeah I have read the paper about Placeto. I am not very familiar with GNN but the result is really promising.

Distributed DL training jobs have some different characteristics from DAG jobs. [1] Not sure if RL + GNN or RNN approach has great improvement in this scenario. But of course, if we have any progress, I can share here.

[1] https://arxiv.org/pdf/1901.05758.pdf

from decima-sim.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.