Was <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id

I would like to reopen the issue touched upon with <a class="issue-link js-issue-link"

After going through the discussion in <a href="https://issues.apache.org/jira/browse/H

controlling number of map and reduce tasks about rmr2 HOT 2 OPEN

piccolbo commented on August 22, 2024

controlling number of map and reduce tasks

from rmr2.

Comments (2)

piccolbo commented on August 22, 2024

I would like to reopen the issue touched upon with RevolutionAnalytics/RHadoop#143. It is clear that my original position that the number of map and reduce tasks is best set on a per-cluster basis, which I based on my own experience and Cloudera recommendations, is not compatible with the variety of workloads that users are creating. At on extreme we have the network IO heavy job, like a web crawler (not that many people write web crawlers in R, but for lack of a better example) for which the desired number of tasks per node may be in the hundreds (unless async IO is used, which I don't think is available in R); in the middle we have local IO heavy processes, like a simple filter; then we have CPU heavy jobs, say model fitting; and finally memory heavy jobs, for which only a small number, maybe even only one job can run effectively on each node. Even with the next gen of Mapreduce a job can't even make the system aware of what its resource needs will be, with the exception of memory. Therefore some per-job configuration will be needed for the foreseeable future. My preference is for a tasks per node setting more than total number of tasks. This seems more generalizable between clusters of a different size and jobs of different size. For instance, a memory intensive task where each node can only run one process will run with a total number of tasks equal to the cluster size. It's difficult to query the cluster size for a streaming application, I couldn't find a clean way of doing it albeit there is a list of slaves in the configuration directory. But if we look at tasks per node, the answer is a constant 1. There are 4 properties we can set to influence this behavior: mapred.map.tasks and mapred.map.task.maximum and the equivalent two for reduce. The "maximum" property is not job-level, it is tasktracker level, so it needs a restart to take effect and clearly won't help with a mixed workload. mapred.map.tasks influences but doesn't actually determine the number of map tasks. So if one sets mapred.map.tasks to 10, assuming the request is honored verbatim by Hadoop, they can run all 10 on one node or 1 per node on a 10 node cluster. This is not very helpful in optimizing job execution. Additional thoughts are welcome.

This discussion has additional information.

from rmr2.

piccolbo commented on August 22, 2024

After going through the discussion in https://issues.apache.org/jira/browse/HADOOP-5170, it seems to me that to generally support jobs with different CPU, memory and IO requirements (I mean in the general case of mixed load in a multitenant cluster) one needs to use the capacity scheduler. Setting the number of map and reduce tasks per job doesn't cut it because there is no guarantee they are not all going to be executed on the same node, but it isn't useless either since in the case of a job with small input that takes a lot of time per unit of input, Hadoop is likely to start too few tasks, and even if one sets their capacity demands very low the desired level of parallelism is not going to be achieved.

from rmr2.

controlling number of map and reduce tasks about rmr2 HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent