hongzimao / decima-sim Goto Github PK

View Code? Open in Web Editor NEW

280.0 280.0 90.0 24.42 MB

Learning Scheduling Algorithms for Data Processing Clusters

Home Page: https://web.mit.edu/decima/

Python 100.00%

decima-sim's People

Stargazers

Watchers

Forkers

godka ruixiaozhang hongbry gaorunzebit reachcool namizzz albao666 cwl233 ngaut kz33 yelianjin jeongsooha gnanendra0305 hoaiocom idanazuri neverking whoismanoj lisiyilearning sanatsharma wangchen615 wzwtime tegg89 cookieyn suwang-coder wangcong333666 gccxeon shihju guoxiaotao aswr amimem zhangyazhe dfc19950924 asinghh samanta-amit jazielinc linyinjie tukyab hanfeiyu hypnotised1 damoners zhangping620 chenjw259 svcostac sunshowsway bvivek777 weexp liluoniuniu samsungproj pivezhan solxags archer-y zhangsj0608 aceyli sylvanhuang umasouttt bigfishli jackokiezhao violetli ricardofym ruifmaxx weifanjiang berisfu suarezlin jiankangren tarekyoung machinelearningsystem yuanquanshichu archiegertsman me1682 horacehxw distributedsystemresearch nnihilism sunxingxingtf heyuanhao mengqiaolan qihuang0 hookk linhang945 lapisliu si699-x-y-y yanto-learning nuts-bottles godeex linvilyao leekho1998 xuxiaoxia96 zwleisa ourfor

decima-sim's Issues

What about cross-server data transmission overhead?

Sorry to bother you again.

In my research area, each stage is scheduled to be placed on some VM node. If its child stages are placed on different VM nodes, cross-node data transmission overhead should be considered. Thus, minimize the makespan can be divided into two subgoals, the execution time and the cross-node communication overhead.

But I found that Decima does not consider the transmission time of intermediate data between the fore-and-aft stages of each job. Is this because the scheduling environment is Spark? Or all the jobs are running on the same "VM node"?

Questions about the input vector

Hi, after reading the paper and the code, I have some questions about the input vector
1、I noticed that the number of executors, which is assigned to the node's DAG, is included in the input vector. Could I ask why decima takes this as one of the feature of a node?
2、the action consists of twp parts, selecting node and selecting the number of executors. I noticed that decima will calculate each job's upper limit. Could I just calcuate the node's upper limit, I mean if it is feasible that if I change the second part of the action into calculating each node's upper limit. Could I ask the reason why Decima choose to calculate each job's upper limit?
Really look forward to your reply, thank you very much :)

Could u plz tell me the piplist of this code?

Could u plz tell me the piplist of this code?
thx

a question about loss founction

Hi,
Here are two part loss in actor agent : adv loss and entropy loss, can you tell me why you add the entropy loss? I know the entropy weight decreased from 1 to 0.0001, but I do not know why you need entropy loss.

thank you!
Liu

Updating Tensorflow 1.14 to 2

Hi, Mao.

There's a specific reason why you didn't update Decima to work with Tensorflow 2? Despite using the tf.contrib...

What is L in Figure 6?

What is L in Fig. 6 and how do you ensure the predicted degree parallelism is greater than the number of executors assigned to this job?

What are the versions of the project pkg requirement?

I get some warning when running your code with tensorflow 1.14 in windows such as

WARNING:tensorflow:From D:\DeepRL\decima-sim-master\actor_agent.py:19: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From test.py:20: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:From test.py:30: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
2021-10-10 23:22:55.372925: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
WARNING:tensorflow:From D:\DeepRL\decima-sim-master\actor_agent.py:39: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
WARNING:tensorflow:From D:\DeepRL\decima-sim-master\gcn.py:30: The name tf.sparse_placeholder is deprecated. Please use tf.compat.v1.sparse_placeholder instead.
WARNING:tensorflow:From D:\DeepRL\decima-sim-master\tf_op.py:34: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.
WARNING:tensorflow:Entity <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x0000014DF8E30C08>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x0000014DF8E30C08>>: AttributeError: module 'gast' has no attribute 'Index'
WARNING:tensorflow:Entity <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x0000014DF8E30188>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x0000014DF8E30188>>: AttributeError: module 'gast' has no attribute 'Index'
WARNING:tensorflow:Entity <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x0000014DF8E1F088>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x0000014DF8E1F088>>: AttributeError: module 'gast' has no attribute 'Index'
WARNING:tensorflow:Entity <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x0000014DF8DFDD48>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x0000014DF8DFDD48>>: AttributeError: module 'gast' has no attribute 'Index'
WARNING:tensorflow:From D:\DeepRL\decima-sim-master\actor_agent.py:225: calling softmax (from tensorflow.python.ops.nn_ops) with dim is deprecated and will be removed in a future version.
Instructions for updating:
dim is deprecated, use axis instead
WARNING:tensorflow:Entity <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x0000014DF8F43408>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x0000014DF8F43408>>: AttributeError: module 'gast' has no attribute 'Index'
WARNING:tensorflow:Entity <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x0000014DF8FAFB48>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x0000014DF8FAFB48>>: AttributeError: module 'gast' has no attribute 'Index'
WARNING:tensorflow:Entity <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x0000014DF8F87748>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x0000014DF8F87748>>: AttributeError: module 'gast' has no attribute 'Index'
WARNING:tensorflow:Entity <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x0000014DF8F55A48>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x0000014DF8F55A48>>: AttributeError: module 'gast' has no attribute 'Index'
WARNING:tensorflow:From D:\DeepRL\decima-sim-master\actor_agent.py:96: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From D:\ProgramData\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\math_grad.py:1250: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From D:\DeepRL\decima-sim-master\actor_agent.py:159: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From D:\ProgramData\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.

So what your pkg requirement of this code?

Questions regarding actor_network

Hello, after reading the code, I have some questions about the details of the actor_network, if these questions can be answered, I will not be very grateful :)

In the part of the node network, why are there three dimensions in the input vector shape of the network instead of the two dimensions of [batch_size, number of features] in the usual design, so what are the considerations and basis for designing the input vector?
In the part of the node network, the features of input vectors combine the output of GCN and "node_inputs". Why is "node_inputs" also merged? Isn't the former a high-level representation of the latter processed by GCN?

About the environment

Would you please list the environment.

For example, the operation system, the requirements.

Do I need to avitivate Spark before running.

A question about the npy files used in spark_env

Question:
There are some npy files in your code, including "task_duration" and "stage_id_to_node_idx_map", and could you please tell me how to generate these files? I'm trying to reuse your code, but I don't know what these files mean.

Thank you!

Some questions about executor

1、what is “source executors”mean?
2、what is "source job/source"mean?
3、what is "use_exec"mean?
4、what is "exec-commit"mean?
5、Why calculate use_exec in this way?

Heuristic agent Dynamic partition

Is dynamic partition agent (class DynamicPartitionAgent) in the repository related to the dynamic-resource-allocation scheduler from spark? Are they the same thing? If not, what is this dynamic partition agent? http://spark.apache.org/docs/2.2.1/job-scheduling.html#dynamic-resource-allocation

Some questions about Decima's GNN

I recently learned this code (gcn-pytorch). Is this part mainly to implement equation 1 in Decima paper?

According to the Decima paper, does the message passing go from leaf to root? The DAG Summary is then computed based on all the nodes in a job. The GCNCov layer in PyTorch Geometric seems to calculate only one massage passing from neighbor nodes, not from leaf to root. I don't know if that's a major part of the message passing scheme that you have designed.

I have another problem that has been puzzling me. What is the significance of 'num_steps' in gcn-pytorch? Is it the same as' max_depth 'in the TensorFlow code? Does it represent the maximum number of message passing from leave to root? How does he control it in training?

I look forward to your reply. Thank you

A question about actor_network

When I use a large dataset to run decima, I get an error "BiasGrad requires tensor size <= int32 max", and I wonder if you met this error? As the data set increases, so does the size of the neural network. Do you have any idea to solve this error?

Question about .npy files

May I ask:
(1) on what basis do you generate the npy files of adj_mat and task_duration, and how to generate them?
(2) In the multi-resource environment, whether the adj_mat.npy file corresponds to the actual data of alibaba?

Thank you!

The model training issue with reward function optimizing makespan

Hi, Hongzi

I noticed your code supports the makespan-optimized policy by setting args.learn_obj to 'makespan'. However, when trained with the recommended small scale setting (200 stream jobs on 8 agents) in 3000 episodes, the model doesn't seem to converge as it normally does with objective of avg JCT. The following figures demonstrate the actor_loss and average_reward_per_second collected during training. The average_reward_per_second is always around -1, which is due to the reward is the same as negative makespan (equal to total time to be divided by). Could you suggest the setting that is maybe missed to guarantee the convergence?

some question about the main idea

Hi, Hongzi
I am new to artificial intelligence algorithms, and my daily work is focused on using Spark. After I read the paper and the implementation code, I had a question about the dataset. Each task in the dataset has a task duration. The purpose of selecting part of the Job DAG to the Train.py is to train an agent responsible for scheduling. After that, the purpose of selecting part of the Job DAG to the Test.py is to detect whether the scheduling effect of the agent reaches the shortened JCT. Can I understand in this way?

However, in the actual production environment, we should not get the duration of these tasks. If I have the honor to use your paper in the production environment, can the task duration be obtained by the predictive algorithm?

I look forward your reply. Thank you so much!

[question] Plan for Alibaba Job Generator

Thanks for the awesome work!

I am interested in training the model with alibaba data trace. But found we do not implement it: https://github.com/hongzimao/decima-sim/blob/master/spark_env/job_generator.py#L107

Do you have any plan for it?

Thanks

some questions of your code

Hello！ I would like to ask what are the versions of the various libraries you use for this code?

Thank you!

Question about the adv_loss

Hello, I have a question about the adv_loss.
In your paper, the value of the advantage part in the network parameter update formula is normal.

But why does self.adv take the opposite value when calculating self.adv_loss in the code? （in actor_agent.py）
In addition, what does the "# actor loss due to advantage (negated)" in the code comment "negated" mean?

# actor loss due to advantage (negated)
self.adv_loss = tf.reduce_sum(tf.multiply(
            tf.log(self.selected_node_prob * self.selected_job_prob + \
                   self.eps), -self.adv))

Thank you!

Questions regarding the paper

1/ I want to ask you about the inputs of your code,
What and where is it exactly in the code ?

2/ Can I change the inputs to see different results, and how ?

A question about the result

Hi,
I noticed that the duration of a task is decided by the code in node.py, which used np.random.randint to generate the cost-time. But if I replace it with np_random which has specified seed, the result I got is still different each time I trained the model. I have no idea why it would happen
Thank you!
Hu

Question about multi resource training.

Hi
I want to do some multi resource training. But I cannot find multi_resource_env.env and multi_resource_agents.actor_agent.
How can I get these file?

Past workload logs

Hi!

Just had a doubt about what you meant in "Decima uses existing monitoring information and past workload logs to automatically learn sophisticated scheduling policies". By "past workload logs", you're talking about the fact that agent learns as he receives rewards, right?

About the model of generation

Train issues

Hi, I'm trying to run your code on 4 Gpus Tesla-p100 (same hardware as you mentioned in your paper) but the time per epoch is 22 sec and after 200 epochs the Gpus are out of memory. Any idea why it happens? there is any configuration to make it run faster?
Thanks,
Idan

About nodes information

Hello. Thank you for giving an interesting model DECIMA.

How can I get a feature vector of a node in job DAG : In the paper it consists of
(i) : # of tasks remaining in the stage
(ii) : The average task duration
(iii) : The number of executors currently working
(iv) : The number of available executors.
(v) : Whether available executors are local to the job.

A trained model

Recently received many requests for our trained model. Here's one after 20,000 steps using the code in the master branch. Hope this helps saving some training time when reproducing our results. https://www.dropbox.com/sh/62niiuaa7103cth/AABxu3ekjOYakmr86gMECZ3Ca?dl=0

Training process is too slow

Hi, I run your code with the example on the readme. But every epoch cost 30+s running on CPU. When I use GPU run the same code, it costs 160+s! In your paper, every epoch cost 1.5s rougly. How can I reduce the trainning time.
My environment as following:
Intel 8180M 2.5GHz (112 core)
Tesla V100 x 2
Ubuntu18.04+Nvidia driver410.48+Cuda10.0.130+libcudnn7.6.5+tensorflow1.15
I set allow_growth=True of tf.config to avoid OOM error.
I want to reproduce your experiments. How to accelerate the process except reducing the size of workload?
Thank you!

Bug in determining `done` in `env.step`?

decima-sim/spark_env/env.py

Lines 338 to 341 in c010dd7

    
           # no more decision to make, jobs all done or time is up 
        
           done = (self.num_source_exec == 0) and \ 
        
                  ((len(self.timeline) == 0) or \ 
        
                  (self.wall_time.curr_time >= self.max_time))

I believe this logic is wrong, and the intended logic is as follows:

done = (self.num_source_exec == 0 and len(self.timeline) == 0) or \
       self.wall_time.curr_time >= self.max_time

The way it's currently written allows the environment to keep going way past the max time.

Is scheduling taken as a Markovian Process which allows Reinforcement Learning to be used?

Decima uses a Reinforcement Learning framework. In reinforcement learning, as mentioned in the appendix, 'The state transitions and rewards are stochastic and assumed to be a Markov process'. However in the Introduction it is mentioned, 'Decima uses existing monitoring information and past workload logs to automatically learn sophisti-cated scheduling policies'

I had two questions : Is scheduling taken as a Markovian Process which allows Reinforcement Learning to be used? Does the neural network based design have something to do with it?

Training and Generalization

As mentioned, Decima is trained by capping the number of incoming jobs to 2000. In Appendix I, Generalizing the neural network model is described where a scaled-down version of workload is used.

As per the neural network architecture, when Decima is trained by limiting the number of jobs to 2000, would the same exact trained neural network be used for deployment in test? How does this scale to a larger number of jobs, ie, how would the neural network structure look like if the number of test jobs are larger?

Questions about the node_acts & job_acts selection.

decima-sim/actor_agent.py

Lines 73 to 80 in c010dd7

    
           logits = tf.log(self.node_act_probs) 
        
           noise = tf.random_uniform(tf.shape(logits)) 
        
           self.node_acts = tf.argmax(logits - tf.log(-tf.log(noise)), 1) 
        
           # job_acts [batch_size, num_jobs, 1] 
        
           logits = tf.log(self.job_act_probs) 
        
           noise = tf.random_uniform(tf.shape(logits)) 
        
           self.job_acts = tf.argmax(logits - tf.log(-tf.log(noise)), 2)

In my opinion, the self.node_act_probs in your code represents the probability of selecting each node, and then you use noise to explore the action space of the Reinforcement Learning problem.

However, after formulating your code, I get $node_acts = argmax(\frac{p}{-log(noise)})$
, where $p \in [0, 1]$ is the probability of selecting each node, $noise \in (0, 1)$ follows normal distribution. But we know that $-log(noise) \in (0, +\infty)$, after the above calculation, maybe the choice of the next node is completely random.

Maybe my understanding is wrong, but how should I understand this implementation?

BTW, have you considered using the $\epsilon-greedy$ policy to explore the action space? And why not select this policy?

Gorot method

Hi!
I'd like to know why you use output_dims as the fan-in in f's and g's (at the Gorot method). I mean, why not use, as the fan-in, hid_dims and as the fan-out, the output_dims?

How to calculate the Job Completion Time?

How to calculate the Job Completion Time.

JCT mean the time from when computer cluster receive the job to when we finish that job.
Each red line in the figure corresponds to the completion time of a job.

About aggregation issues in the messaging passing

decima-sim/msg_passing_path.py

Line 87 in c010dd7

def get_bottom_up_paths(job_dag):

The “msg_mats, msg_masks” generated in this function will repeatedly aggregate its child_node original feature and embedded feature in some cases.

for instance, a dag_job as below

and its msg_mats is as below,

According to the msg_masks and msg_mats generated above,

In the first step of message passing, node e passes the original feature to node d, so that d aggregates the original feature of e, and node f passes its original feature to generate the embedded feature of e.

In the second step of the message passing, node e will pass its embedded feature to node d, so that node d will aggregate the embedded features of e.

So the last node d not only aggregates the embedded features of e, but also aggregates the original features of e.

But according to my understanding, in the paper, node d should only aggregate the embedded feature of its child e.

Did I get it wrong? I really hope to get your reply, thank you!

Questions and answers regrading codes in spark_env/job_dag.py

Here are some questions I met when reading codes in /decima-sim/spark_env/.
I asked Hongzi and he replied clearly and concisely.
It's an honor to post Hongzi's answers here so that maybe someone has the same questions can get some inspiration form Hongzi's answers.

There're some questions about the fundamental assumption of spark — every job ends with a single final stage I can’t really figure out.

reply form Hongzi

It’s just every Spark job (at least the ones we studied) ends with a single leaf node. You can see this in the DAG visualization that you have found in spark_env/tpch/dag_visualization.

first question

My first question is about the code

def merge_job_dags(job_dags):
    # merge all DAGs into a general big DAG
    # this function will modify the original data structure
    # 1. take nodes from the natural order
    # 2. wire the parent and children across DAGs
    # 3. reconstruct adj_mat by properly connecting
    # the new edges among individual adj_mats

    total_num_nodes = sum([d.num_nodes for d in job_dags])
    nodes = []
    adj_mat = np.zeros([total_num_nodes, total_num_nodes])

    base = 0  # for figuring out new node index
    leaf_nodes = []  # leaf nodes in the current job_dag

    for job_dag in job_dags:

        num_nodes = job_dag.num_nodes

        for n in job_dag.nodes:
            n.idx += base
            nodes.append(n)

        # update the adj matrix
        adj_mat[base : base + num_nodes, \
            base : base + num_nodes] = job_dag.adj_mat

        # fundamental assumption of spark --
        # every job ends with a single final stage
        if base != 0:  # at least second job
            for i in range(num_nodes):
                if np.sum(job_dag.adj_mat[:, i]) == 0:
                    assert len(job_dag.nodes[i].parent_nodes) == 0
                    adj_mat[base - 1, base + i] = 1

        # store a set of new root nodes
        root_nodes = []
        for n in job_dag.nodes:
            if len(n.parent_nodes) == 0:
                root_nodes.append(n)

        # connect the root nodes with leaf nodes
        for root_node in root_nodes:
            for leaf_node in leaf_nodes:
                leaf_node.child_nodes.append(root_node)
                root_node.parent_nodes.append(leaf_node)

        # store a set of new leaf nodes
        leaf_nodes = []
        for n in job_dag.nodes:
            if len(n.child_nodes) == 0:
                leaf_nodes.append(n)

        # update base
        base += num_nodes

    assert len(nodes) == adj_mat.shape[0]

    merged_job_dag = JobDAG(nodes, adj_mat)

    return merged_job_dag

        # fundamental assumption of spark --
        # every job ends with a single final stage
        if base != 0:  # at least second job
            for i in range(num_nodes):
                if np.sum(job_dag.adj_mat[:, i]) == 0:
                    assert len(job_dag.nodes[i].parent_nodes) == 0
                    adj_mat[base - 1, base + i] = 1

I guess the base - 1 here is because all the job_dags ends with the last idx node?
So base - 1 just means the final stage of the previous job_dag?

        # store a set of new leaf nodes
        leaf_nodes = []
        for n in job_dag.nodes:
            if len(n.child_nodes) == 0:
                leaf_nodes.append(n)

follw my previous guess and the assumption every job ends with a single final stage, I think there is only one leaf node is base + job_dag.num_nodes -1.

reply form Hongzi

I think your understanding is correct. This function is connecting multiple DAGs sequentially to make them a bigger one. However, I don’t think this function is used anywhere else in the repo. At some point we wanted to study Decima’s behavior of scheduling a single DAG and we used functions similar to this one to create big DAGs.

there is still some confusions, so I asked Hongzi again

I didn’t express clearly last time. What I wanted to say is that in my understanding there are no leaf nodes but only one leaf node — base + job_dag.num_nodes - 1.
So I’m confused about the leaf_nodes setup.

reply from Hongzi

Hmm, I don’t think we explicitly put the leaf node at the end of the list of nodes. We just pick the leaf node based on its definition — that it doesn’t have a child node to it (len(n.child_nodes == 0).

second question

My second question is about the pictures of decima-sim/spark_env/tpch/dag_visualization and decima-sim/spark_env/tpch/task_durations

I understand the 0 1 2 3 means the idx the node but I’m not sure whether the 200 means 200 tasks and 0.30 sec means the duration of a single task?

reply from Hongzi

You assumption is correct. Here’s the code snippets to generate the text on each node

        # a *very* rough average of task duration
        for n in task_durations:
            task_duration = task_durations[n]

            e = next(iter(task_duration['first_wave']))
            num_tasks = len(task_duration['first_wave'][e]) + \
                        len(task_duration['rest_wave'][e])

            avg_task_duration = np.mean(
                [i for l in task_duration['first_wave'].values() for i in l] + \
                [i for l in task_duration['rest_wave'].values() for i in l] + \
                [i for l in task_duration['fresh_durations'].values() for i in l])

            nodes_mat[n, 0] = num_tasks
            nodes_mat[n, 1] = '{0:.2f}'.format(avg_task_duration / 1000.0) + 'sec'

And I also don’t understand what the different colors mean here.

reply from Hongzi

The color means different “waves” of task durations under different number of executors. More details of the wave behavior can be found in the Decima paper section 3 and section 6.2 (1). The code snippet for visualizing is

    for exec_num in sorted(duration_info.keys()):

        fw = duration_info[exec_num]['first_wave']
        rw = duration_info[exec_num]['rest_wave']
        fd = duration_info[exec_num]['fresh_durations']

        plt.plot([plt_idx] * len(fw), fw, 'rx')
        plt.plot([plt_idx] * len(rw), rw, 'bx')
        plt.plot([plt_idx] * len(fd), fd, 'gx')

        plt_idx += 1

Code for Learning State Representation

Please pardon my naivete but I was unable to pinpoint where the code is to learn state representations. More specifically, I am interested in reviewing how do you produce job embeddings. The paper talks very little about this idea.

Question and answer regarding Decima paper

Question:

According to the appendix section in the paper, you used supervised learning to train graph neural networks for the sanity check.
I presume the target (label) is the critical path which is stated in the JobDAGDuration class from the job_dag.py file.
While training the code, this class is ignored.

• So, does the GNNs (GCN & GSN) are followed by the unsupervised learning scheme?
• In that sense, the GNNs act as preprocessing to capture local/global summaries of the loaded jobs, and I believe the code is running based on the fixed numbers of input jobs. Is there any way to handle various numbers of incoming jobs?

======================================================================
Hongzi's answer:

The appendix experiment is just to make sure the GNN architecture at least have the power to express existing heuristics that used critical path. In the main paper, Decima scheduling agent is trained end-to-end with reinforcement learning. This includes the weights of the GNN (since the entire neural network in figure 6 is trained together). Therefore, as expected, the main training code won’t invoke the critical path module during training.

Also, Decima’s GNN handles variable number of jobs by its design. Please notice that the default training in our code is with streaming type of jobs (jobs keep coming into the system) with flag --num_stream_dags 200. The section 5.1 in our paper explained in details why this design is scalable to arbitrary DAG shape and size.

A question about multi-resource

I can't find these folders：multi_resource_agents ，multi_resource_env. How can I run this ”multi_resource_test.py”？

How to integrate Decima in Spark

Hello @hongzimao ,
Thank you for sharing the awesome work. I think the code is only for the simulator part. Could you give me some instructions or code about how to integrate into a real cluster?

PyTorch Implementation of Decima Available!

Shameless Plug Incoming...
For anyone interested in a PyTorch implementation of Decima, I implemented one for my Master's research project, and it can be found here. I also made some enhancements to the RL algorithm and GNN model, and added a simple renderer. I am happy to answer questions about Decima, to the best of my understanding, as I spent a lot of time precisely recreating it. Feel free to open issues on my repo or email me at [email protected].

a hidden bug in the function named update_frontier_nodes

I think there is a bug in the following function, defined in spark_env/job_dag.py:

def update_frontier_nodes(self, node):
        frontier_nodes_changed = False
        for child in node.child_nodes:
            if child.is_schedulable():
                if child.idx not in self.frontier_nodes:   # bug is here
                    self.frontier_nodes.add(child)
                    frontier_nodes_changed = True
        return frontier_nodes_changed

What self.frontier_nodes stores are the nodes themselves, not their indices. Although this did not have a significant effect on the training results.

Average interarrival time

Hi!
Where can I get the interarrival time of the continuous arrivals?

when calculating the reward whether the locality of the data to the core is taken into account? in other word,the data transfer between different nodes may significant affect the reward calculation.

About the file “env.py”

Does "self.source_job" means the job that you decide to schedule in this step?
In "get_executor_limits" this function, will all the jobs' limit except souce job be 0 ?
What does "self.exec_commit" mean?

Thank you!

Question about the NN structure of Decima and its relation with the problem size

Questions:

about the input space size and the number of entries of the softmax for selecting state/node: Do you make an assumption about the maximum concurrent runnable nodes/stages in the system? From the paper (Figure 6), it seems that this value (n) needs to be predefined and the softmax should have the same number of input/output entries, is that correct?
It seems the softmax function for selecting the maximum parallelism also needs to fix the number of input/output entries beforehand. You have stated in the paper " Since the number of possible limits can be as large as the number of executors," , so if we want to apply your solution to a new larger cluster, the number of entries for the softmax function should be increased proportionally to the new cluster size. Is my understanding correct?

Answers from Hongzi:

No we don't restrict the number of total nodes. Note that the softmax operation is scale-free --- the input to softmax can have arbitrary size (it's just exponentials with normalization of the their sum). Check out the softmax function in tensorflow or pytorch and how they apply to the input vector.
I think your understanding is correct. We were being lazy and use an output node to represent a parallelism limit --- so you need n nodes if you have n executors. But a more scalable way is to just output the parallelism limit as a number. You can express such continuous (round it afterwards) number by a Gaussian distribution. The neural network output the mean and you sample from the Gaussian distribution (similar to how you sample from softmax output).

Thanks!

Some questions about GSN

What does summ_mats stand for in GSN and where is it instantiated？

State information and features

Hi,
In Section 5.1, you talk about Decima converting the state information into features. Are the features always those in the image?

	# no more decision to make, jobs all done or time is up
	done = (self.num_source_exec == 0) and \
	((len(self.timeline) == 0) or \
	(self.wall_time.curr_time >= self.max_time))

	logits = tf.log(self.node_act_probs)
	noise = tf.random_uniform(tf.shape(logits))
	self.node_acts = tf.argmax(logits - tf.log(-tf.log(noise)), 1)

	# job_acts [batch_size, num_jobs, 1]
	logits = tf.log(self.job_act_probs)
	noise = tf.random_uniform(tf.shape(logits))
	self.job_acts = tf.argmax(logits - tf.log(-tf.log(noise)), 2)

hongzimao / decima-sim Goto Github PK

decima-sim's People

Stargazers

Watchers

Forkers

decima-sim's Issues

first question

second question

Recommend Projects

Recommend Topics

Recommend Org