Cannot execute H2O on all Spark executors about sparkling-water HOT 17 CLOSED

h2oai commented on May 22, 2024

Cannot execute H2O on all Spark executors

from sparkling-water.

Comments (17)

aswinjoseroy commented on May 22, 2024 3

I am getting this on Spark 1.6.0 (tested on standalone and a yarn cluster).

from sparkling-water.

nftw commented on May 22, 2024

I get a similar issue if I use Spark 1.2.1, it works with Spark 1.1.1

val h2oContext = new H2OContext(sc).start()

java.lang.IllegalArgumentException: Cannot execute H2O on all Spark executors:
  numH2OWorkers = -1"
  executorStatus = (0,false),(1,false),(2,false),(0,false),(1,false),(2,false),(1,false),(1,false),(2,false),(0,false),(2,false),(0,false),(2,false),(0,false),(1,false),(2,false),(0,false),(1,false),(2,false),(1,false),(0,false),(1,false),(2,false),(0,false),(2,false),(1,false),(1,false),(2,false),(0,false),(0,false),(1,false),(2,false),(0,false),(2,false),(1,false),(1,false),(0,false),(2,false),(0,false),(1,false),(1,false),(2,false),(0,false),(2,false),(1,false),(0,false),(1,false),(2,false),(0,false),(2,false),(1,false),(0,false),(1,false),(2,false),(2,false),(0,false),(1,false),(0,false),(1,false),(2,false),(0,false),(2,false),(1,false),(0,false),(0,false),(2,false),(2,false),(1,false),(1,false),(0,false),(2,false),(2,false),(1,false),(1,false),(0,false),(0,false),(2,false),(1,false),(2,false),(1,false),(0,false),(0,false),(2,false),(1,false),(2,false),(0,false),(1,false),(1,false),(2,false),(0,false),(1,false),(0,false),(2,false),(0,false),(1,false),(1,false),(2,false),(0,false),(2,false),(0,false),(1,false),(2,false),(1,false),(0,false),(2,false),(0,false),(1,false),(2,false),(1,false),(0,false),(2,false),(0,false),(1,false),(2,false),(1,false),(2,false),(0,false),(1,false),(0,false),(2,false),(1,false),(2,false),(0,false),(1,false),(0,false),(1,false),(2,false),(2,false),(0,false),(1,false),(0,false),(2,false),(1,false),(0,false),(2,false),(0,false),(1,false),(2,false),(1,false),(2,false),(0,false),(0,false),(1,false),(2,false),(2,false),(1,false),(0,false),(0,false),(1,false),(2,false),(2,false),(0,false),(1,false),(0,false),(1,false),(2,false),(0,false),(0,false),(2,false),(1,false),(1,false),(2,false),(1,false),(2,false),(1,false),(0,false),(0,false),(2,false),(1,false),(2,false),(1,false),(0,false),(2,false),(0,false),(2,false),(1,false),(1,false),(0,false),(0,false),(2,false),(2,false),(1,false),(1,false),(0,false),(2,false),(0,false),(2,false),(1,false),(1,false),(2,false),(0,false),(0,false),(2,false),(1,false),(1,false),(2,false),(0,false),(1,false),(2,false),(0,false)
    at org.apache.spark.h2o.H2OContext.start(H2OContext.scala:112)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:18)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:23)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:25)
    at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27)
    at $iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
    at $iwC$$iwC$$iwC.<init>(<console>:31)
    at $iwC$$iwC.<init>(<console>:33)
    at $iwC.<init>(<console>:35)
    at <init>(<console>:37)
    at .<init>(<console>:41)
    at .<clinit>(<console>)
    at .<init>(<console>:7)
    at .<clinit>(<console>)
    at $print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
    at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
    at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
    at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
    at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
    at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
    at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)
    at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)
    at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)
    at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)
    at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
    at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
    at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
    at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)
    at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)
    at org.apache.spark.repl.Main$.main(Main.scala:31)
    at org.apache.spark.repl.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

from sparkling-water.

nftw commented on May 22, 2024

Spark 1.2.1 works fine with latest commit (265c1be). I was using the 0.2.10-81 release previously.

from sparkling-water.

mmalohlava commented on May 22, 2024

Perfect, thanks for trying!

Let us how Sparkling Water works for you and if you miss something there!
Thank you!
michal

Dne 3/12/15 v 8:21 AM nftw napsal(a):

Spark 1.2.1 works fine with latest commit (265c1be
265c1be). I was
using the 0.2.10-81 release previously.

—
Reply to this email directly or view it on GitHub
#4 (comment).

from sparkling-water.

lev112 commented on May 22, 2024

Hi,
I'm working with spark 1.4.0 and sparkling water 1.4.3
and getting the same error:

ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: Cannot execute H2O on all Spark executors:
 Expected number of H2O workers is 12
 Detected number of Spark workers is 11
 Num of Spark executors before is 11
 Num of Spark executors after is 11

I'm running on a yarn cluster,
and I suspect that because the cluster is busy, I'm getting only part of the executors I required.
looking at the H2OContext code, I see that the number of h2o workers must be equal to the spark executors.

Maybe it is better to have spark.ext.h2o.cluster.size indicate the minimal size of the cluster to handle this situation?

I can try and do a PR if you think it is a good idea

from sparkling-water.

mmalohlava commented on May 22, 2024

Hi,

you are right!
Sometimes yarn cluster is too busy to provide all executors at once.
Sparkling Water will wait little bit, but if we cannot see all executors in limited amount of time right now we give up.

You can try to modify H2OContext, however, the main problem there is that we co-locate h2o data with spark data. For example, if a RDD is created and its partition is on a new node, which H2O did not see during launch, we will not able to create h2o chunk (~partition) there. (Simple solution for this situation is to download data from that node to existing h2o worker).

Does it make sense?

from sparkling-water.

lev112 commented on May 22, 2024

It is,
Thanks!

Since I don't think it is possible tell spark to delay execution until all of the requested resources were received,
I will just config a high number of retries.

from sparkling-water.

mmalohlava commented on May 22, 2024

You are right - we provided PR for SPark to introduce hooks into lifecycle of executors, but it was not accepted. So right now, we doing number of retries to figure out number of executors.

from sparkling-water.

idanz commented on May 22, 2024

In a similar note, I'd like to ask what will happen to H2O if executors die (because of dynamic allocation or yarn pre-emption) - will the application fail?

from sparkling-water.

mmalohlava commented on May 22, 2024

On 9/25/15 1:27 AM, Idan Zalzberg wrote:

In a similar note, I'd like to ask what will happen to H2O if executors die (because of dynamic
allocation or yarn pre-emption) - will the application fail?

Right now, h2o cluster will go down and you have to repeat the computation (e.g., creation of model).

michal

—
Reply to this email directly or view it on GitHub
#4 (comment).

from sparkling-water.

jakubhava commented on May 22, 2024

This is a known technical problem, however we created a new sparkling water backend to solve this issue. Please refer to External backend https://github.com/h2oai/sparkling-water/blob/master/doc/backends.md for more information

from sparkling-water.

NkululekoThangelane commented on May 22, 2024

Hi @jakubhava

Was this issue resolved.
I am not sure what the problem is.
I am still getting this problem on sparkling water.
When trying to convert a spark data frame to a H2o frame I get the following.:
Py4JJavaError: An error occurred while calling o245.asH2OFrame.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 192 in stage 418.0 failed 4 times, most recent failure: Lost task 192.3 in stage 418.0 : java.lang.ArrayIndexOutOfBoundsException: 65535

from sparkling-water.

jakubhava commented on May 22, 2024

Hi @NkululekoThangelane,
without further information, I would suspect that in your case one Spark executor died. Have you started H2OContext? Can you please share your code?

from sparkling-water.

NkululekoThangelane commented on May 22, 2024

HI @jakubhava
I restarted my spark context and H20 Context and the problem simply went away.

from sparkling-water.

sgt101 commented on May 22, 2024

This ticket was from a very distant era of H2O, I doubt it's relevant to the current system.

…

On Mon, Feb 26, 2018 at 1:01 PM, Nkululeko Thangelane < ***@***.***> wrote: HI @jakubhava <https://github.com/jakubhava> I restarted my spark context and H20 Context and the problem simply went away. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AATC_vdVR29R-x3tZocC0Sb9hke6OiyNks5tYqsugaJpZM4DTdtR> .

from sparkling-water.

JeremyLG commented on May 22, 2024

@NkululekoThangelane This issue is still around.

Sometimes when I run my jar on a YARN cluster, it gives me this error. I'm running on HDP 2.6.4 with Spark 2.2 and Sparkling Water 2.2.6

I still don't know how to reproduce it.

Is there a spark config option or a way in my scala code to make the jar robust to this kind of error ? Like wait longer for the initialisation or anything else.

Thanks

from sparkling-water.

jakubhava commented on May 22, 2024

@JeremyLG please have a look on https://github.com/h2oai/sparkling-water/blob/master/doc/backends.md . In some environments this is a known issue which can be eliminated by external backend solution.

In the original backend solution, we have to face the problem of dynamic allocation and yarn preemption. In case some new executor joins the Spark cluster or disconnects, we are not able to handle this cluster change in H2O and have to stop the cluster.

from sparkling-water.

Cannot execute H2O on all Spark executors about sparkling-water HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent