Giter Site home page Giter Site logo

Comments (2)

piccolbo avatar piccolbo commented on August 22, 2024

I would like to reopen the issue touched upon with RevolutionAnalytics/RHadoop#143. It is clear that my original position that the number of map and reduce tasks is best set on a per-cluster basis, which I based on my own experience and Cloudera recommendations, is not compatible with the variety of workloads that users are creating. At on extreme we have the network IO heavy job, like a web crawler (not that many people write web crawlers in R, but for lack of a better example) for which the desired number of tasks per node may be in the hundreds (unless async IO is used, which I don't think is available in R); in the middle we have local IO heavy processes, like a simple filter; then we have CPU heavy jobs, say model fitting; and finally memory heavy jobs, for which only a small number, maybe even only one job can run effectively on each node. Even with the next gen of Mapreduce a job can't even make the system aware of what its resource needs will be, with the exception of memory. Therefore some per-job configuration will be needed for the foreseeable future. My preference is for a tasks per node setting more than total number of tasks. This seems more generalizable between clusters of a different size and jobs of different size. For instance, a memory intensive task where each node can only run one process will run with a total number of tasks equal to the cluster size. It's difficult to query the cluster size for a streaming application, I couldn't find a clean way of doing it albeit there is a list of slaves in the configuration directory. But if we look at tasks per node, the answer is a constant 1. There are 4 properties we can set to influence this behavior: mapred.map.tasks and mapred.map.task.maximum and the equivalent two for reduce. The "maximum" property is not job-level, it is tasktracker level, so it needs a restart to take effect and clearly won't help with a mixed workload. mapred.map.tasks influences but doesn't actually determine the number of map tasks. So if one sets mapred.map.tasks to 10, assuming the request is honored verbatim by Hadoop, they can run all 10 on one node or 1 per node on a 10 node cluster. This is not very helpful in optimizing job execution. Additional thoughts are welcome.

This discussion has additional information.

from rmr2.

piccolbo avatar piccolbo commented on August 22, 2024

After going through the discussion in https://issues.apache.org/jira/browse/HADOOP-5170, it seems to me that to generally support jobs with different CPU, memory and IO requirements (I mean in the general case of mixed load in a multitenant cluster) one needs to use the capacity scheduler. Setting the number of map and reduce tasks per job doesn't cut it because there is no guarantee they are not all going to be executed on the same node, but it isn't useless either since in the case of a job with small input that takes a lot of time per unit of input, Hadoop is likely to start too few tasks, and even if one sets their capacity demands very low the desired level of parallelism is not going to be achieved.

from rmr2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.