Giter Site home page Giter Site logo

h2oai / sparkling-water Goto Github PK

View Code? Open in Web Editor NEW
952.0 178.0 363.0 47.8 MB

Sparkling Water provides H2O functionality inside Spark cluster

Home Page: https://docs.h2o.ai/sparkling-water/3.3/latest-stable/doc/index.html

License: Apache License 2.0

Shell 1.13% Scala 66.71% Java 1.33% Batchfile 0.33% Python 16.69% CSS 0.17% Groovy 4.56% R 2.38% HCL 1.67% TeX 5.03%
h2o spark machine-learning integration pysparkling rsparkling big-data pyspark scala

sparkling-water's Introduction

sparkling-water-logo

mvn-badge apache-2-0-license Powered by H2O.ai

Sparkling Water

Sparkling Water integrates H2O-3, a fast scalable machine learning engine with Apache Spark. It provides:

  • Utilities to publish Spark data structures (RDDs, DataFrames, Datasets) as H2O-3's frames and vice versa.
  • DSL to use Spark data structures as input for H2O's algorithms.
  • Basic building blocks to create ML applications utilizing Spark and H2O APIs.
  • Python interface enabling use of Sparkling Water directly from PySpark.

Getting Started

User Documentation

Read the documentation for Spark 3.5 (or 3.4 , 3.3 , 3.2 , 3.1, 3.0, 2.4, 2.3)

Download Binaries

Download the latest version for Spark 3.5 (or 3.4, 3.3, 3.2, 3.1, 3.0, 2.4, 2.3)

Each Sparkling Water release is also published into the Maven Central (more details below).


Try Sparkling Water!

Sparkling Water is distributed as a Spark application library which can be used by any Spark application. Furthermore, we provide also zip distribution which bundles the library and shell scripts.

There are several ways of using Sparkling Water:

  • Sparkling Shell (Spark Shell with Sparkling Water included)
  • Sparkling Water driver (Spark Submit with Sparkling Water included)
  • Spark Shell and include Sparkling Water library via --jars or --packages option
  • Spark Submit and include Sparkling Water library via --jars or --packages option
  • PySpark with PySparkling

Run Sparkling shell

The Sparkling shell encapsulates a regular Spark shell and append Sparkling Water library on the classpath via --jars option. The Sparkling Shell supports creation of an H2O-3 cloud and execution of H2O-3 algorithms.

  1. Either download or build Sparkling Water
  2. Configure the location of Spark cluster:

    export SPARK_HOME="/path/to/spark/installation"
    export MASTER="local[*]"

    In this case, local[*] points to an embedded single node cluster.

  3. Run Sparkling Shell:

    bin/sparkling-shell

    Sparkling Shell accepts common Spark Shell arguments. For example, to increase memory allocated by each executor, use the spark.executor.memory parameter: bin/sparkling-shell --conf "spark.executor.memory=4g"

  4. Initialize H2OContext

    import ai.h2o.sparkling._
    val hc = H2OContext.getOrCreate()

    H2OContext starts H2O services on top of Spark cluster and provides primitives for transformations between H2O-3 and Spark data structures.

Use Sparkling Water with PySpark

Sparkling Water can be also used directly from PySpark and the integration is called PySparkling.

See PySparkling README to learn about PySparkling.

Use Sparkling Water via Spark Packages

To see how Sparkling Water can be used as Spark package, please see Use as Spark Package.

Use Sparkling Water in Windows environments

See Windows Tutorial to learn how to use Sparkling Water in Windows environments.

Sparkling Water examples

To see how to run examples for Sparkling Water, please see Running Examples.

Maven packages

Each Sparkling Water release is published into Maven central with following coordinates:

  • ai.h2o:sparkling-water-core_{{scala_version}}:{{version}} - Includes core of Sparkling Water
  • ai.h2o:sparkling-water-examples_{{scala_version}}:{{version}} - Includes example applications
  • ai.h2o:sparkling-water-repl_{{scala_version}}:{{version}} - Spark REPL integration into H2O Flow UI
  • ai.h2o:sparkling-water-ml_{{scala_version}}:{{version}} - Extends Spark ML package by H2O-based transformations
  • ai.h2o:sparkling-water-scoring_{{scala_version}}:{{version}} - A library containing scoring logic and definition of Sparkling Water MOJO models.
  • ai.h2o:sparkling-water-scoring-package_{{scala_version}}:{{version}} - Lightweight Sparkling Water package including all dependencies required just for scoring with H2O-3 and DAI MOJO models.
  • ai.h2o:sparkling-water-package_{{scala_version}}:{{version}} - Sparkling Water package containing all dependencies required for model training and scoring. This is designed to use as Spark package via --packages option.

    Note: The {{version}} references to a release version of Sparkling Water, the {{scala_version}} references to Scala base version.

The full list of published packages is available here.


Sparkling Water Backends

Sparkling water supports two backend/deployment modes - internal and external. Sparkling Water applications are independent on the selected backend. The backend can be specified before creation of the H2OContext.

For more details regarding the internal or external backend, please see Backends.


FAQ

List of all Frequently Asked Questions is available at FAQ.


Development

Complete development documentation is available at Development Documentation.

Build Sparkling Water

To see how to build Sparkling Water, please see Build Sparkling Water.

Develop applications with Sparkling Water

An application using Sparkling Water is regular Spark application which bundling Sparkling Water library. See Sparkling Water Droplet providing an example application here.

Contributing

Just drop us a PR! For inspiration look at our list of issues, feel free to create one.

Filing Bug Reports and Feature Requests

You can file a bug report of feature request directly in Github Issues Github Issues.

Have Questions?

We also respond to questions tagged with sparkling-water and h2o tags on the Stack Overflow.

Change Logs

Change logs are available at Change Logs.


sparkling-water's People

Contributors

arnocandel avatar bghill avatar boroborome avatar chathurindaranasinghe avatar dzlab avatar ericeckstrand avatar h2o-ops avatar jakubhava avatar jangorecki avatar jessica0xdata avatar kanech avatar krasinski avatar ledell avatar mdymczyk avatar meganjkurka avatar miuma2 avatar mklechan avatar mmalohlava avatar mn-mikke avatar navdeep-g avatar neema-m avatar nidhimehta avatar nikhilshekhar avatar ningbogao avatar petro-rudenko avatar satyakonidala avatar tomkraljevic avatar vpatryshev avatar zainhaq-h2o avatar zhiruiwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sparkling-water's Issues

NullPointerException when load big csv file from hdfs

My sparking-water is 1.6.3, my spark is CDH 5.7 - spark 1.6
I have a big csv file with 1440790 rows, more than 400 column.
I import it from hdfs, And during parseFiles I got a error as blow.

DistributedException from xxxx:54321, caused by java.lang.NullPointerException.

and the detail log from spark AM is blow

java.lang.NullPointerException
at water.fvec.FileVec.chunkOffset(FileVec.java:84)
at water.persist.PersistHdfs.load(PersistHdfs.java:151)
at water.persist.PersistManager.load(PersistManager.java:143)
at water.Value.loadPersist(Value.java:237)
at water.Value.memOrLoad(Value.java:119)
at water.Value.write_impl(Value.java:346)
at water.Value$Icer.write67(Value$Icer.java)
at water.Value$Icer.write(Value$Icer.java)
at water.Iced.write(Iced.java:61)
at water.AutoBuffer.put(AutoBuffer.java:730)
at water.TaskPutKey$Icer.write66(TaskPutKey$Icer.java)
at water.TaskPutKey$Icer.write(TaskPutKey$Icer.java)
at water.H2O$H2OCountedCompleter.write(H2O.java:1220)
at water.AutoBuffer.put(AutoBuffer.java:730)
at water.RPC.call(RPC.java:201)
at water.RPC.call(RPC.java:102)
at water.TaskPutKey.put(TaskPutKey.java:15)
at water.DKV.DputIfMatch(DKV.java:149)
at water.DKV.DputIfMatch(DKV.java:94)
at water.fvec.FileVec.chunkIdx(FileVec.java:108)
at water.fvec.Vec.chunkForChunkIdx(Vec.java:891)
at water.fvec.ByteVec.chunkForChunkIdx(ByteVec.java:20)
at water.fvec.ByteVec.chunkForChunkIdx(ByteVec.java:16)
at water.parser.FVecParseReader.getChunkData(FVecParseReader.java:25)
at water.parser.CsvParser.parseChunk(CsvParser.java:427)
at water.parser.ParseDataset$MultiFileParseTask$DistributedParse.map(ParseDataset.java:931)
at water.MRTask.compute2(MRTask.java:619)
at water.H2O$H2OCountedCompleter.compute1(H2O.java:1184)
at water.parser.ParseDataset$MultiFileParseTask$DistributedParse$Icer.compute1(ParseDataset$MultiFileParseTask$DistributedParse$Icer.java)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1180)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:914)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:979)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
java.lang.NullPointerException
at water.fvec.FileVec.chunkOffset(FileVec.java:84)
at water.persist.PersistHdfs.load(PersistHdfs.java:151)
at water.persist.PersistManager.load(PersistManager.java:143)
at water.Value.loadPersist(Value.java:237)
at water.Value.memOrLoad(Value.java:119)
at water.Value.write_impl(Value.java:346)
at water.Value$Icer.write67(Value$Icer.java)
at water.Value$Icer.write(Value$Icer.java)
at water.Iced.write(Iced.java:61)
at water.AutoBuffer.put(AutoBuffer.java:730)
at water.TaskPutKey$Icer.write66(TaskPutKey$Icer.java)
at water.TaskPutKey$Icer.write(TaskPutKey$Icer.java)
at water.H2O$H2OCountedCompleter.write(H2O.java:1220)
at water.AutoBuffer.put(AutoBuffer.java:730)
at water.RPC.call(RPC.java:201)
at water.RPC.call(RPC.java:102)
at water.TaskPutKey.put(TaskPutKey.java:15)
at water.DKV.DputIfMatch(DKV.java:149)
at water.DKV.DputIfMatch(DKV.java:94)
at water.fvec.FileVec.chunkIdx(FileVec.java:108)
at water.fvec.Vec.chunkForChunkIdx(Vec.java:891)
at water.fvec.ByteVec.chunkForChunkIdx(ByteVec.java:20)
at water.fvec.ByteVec.chunkForChunkIdx(ByteVec.java:16)
at water.parser.FVecParseReader.getChunkData(FVecParseReader.java:25)
at water.parser.CsvParser.parseChunk(CsvParser.java:427)
at water.parser.ParseDataset$MultiFileParseTask$DistributedParse.map(ParseDataset.java:931)
at water.MRTask.compute2(MRTask.java:619)
at water.MRTask.compute2(MRTask.java:577)
at water.MRTask.compute2(MRTask.java:577)
at water.MRTask.compute2(MRTask.java:577)
at water.H2O$H2OCountedCompleter.compute1(H2O.java:1184)
at water.parser.ParseDataset$MultiFileParseTask$DistributedParse$Icer.compute1(ParseDataset$MultiFileParseTask$DistributedParse$Icer.java)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1180)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

Question/Observation Launching Instances

This might have to do with my setup but unsure which is why I will ask here...

I have a small standalone Spark Cluster consisting of 1 Master and 3 Workers. For normal tasks everything works just fine. I have compiled sparkling-water rel 1.6 for my version of Spark.

I have included the the sparkling-water jar in my spark-defaults.conf (spark.executor.extraClassPath and spark.driver.extraClassPath) across all Spark Workers and the Master

When I launch a spark-shell (remote or on master) and instantiate H2O
scala>
import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(sc)
OR
val h2oContext = new H2OContext(sc).start(3)

I only get one instance which launches locally of H2O running on the node which I launched spark-shell

Is this the correct behavior? Most of the examples I have seen just launch instances locally on the same machine.

I was expecting instances to spin up on all workers, in this case 3 instances, 1 across each of the Spark Workers.

Am I missing something here? Please advise, and greatly appreciated , thanks

"IPs are not equal" error when starting H2OContext with Spark Context

Hi there,

I'm trying to run the pysparkling Chicago_Crime_Demo example in my 2-node (36 core) Spark cluster. The pysparkling executable runs fine and loads the Spark Context. When I try to start an H2O context though, I run into the error below. Any idea what is causing it?

Thanks!

$MASTER=spark://17.207.129.212:7077 pysparkling 
Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
16/06/30 15:09:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)
SparkContext available as sc, SQLContext available as sqlContext.
>>> sc
<pyspark.context.SparkContext object at 0x109df2350>
>>> from pysparkling import *
>>> sc
<pyspark.context.SparkContext object at 0x109df2350>
>>> hc= H2OContext(sc).start()
16/06/30 15:10:51 WARN H2OContext: Increasing 'spark.locality.wait' to value 30000
16/06/30 15:10:51 WARN H2OContext: The property 'spark.scheduler.minRegisteredResourcesRatio' is not specified!
We recommend to pass `--conf spark.scheduler.minRegisteredResourcesRatio=1`
16/06/30 15:10:52 WARN TaskSetManager: Lost task 1.0 in stage 6.0 (TID 127, fisher.local): java.lang.AssertionError: assertion failed: SpreadRDD failure - IPs are not equal: (1,169.254.183.61,-1,169.254.183.61) != (0, 169.254.3.199)
    at scala.Predef$.assert(Predef.scala:179)
    at org.apache.spark.h2o.H2OContextUtils$$anonfun$7.apply(H2OContextUtils.scala:107)
    at org.apache.spark.h2o.H2OContextUtils$$anonfun$7.apply(H2OContextUtils.scala:106)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

16/06/30 15:10:52 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID 126, pearson.local): java.lang.AssertionError: assertion failed: SpreadRDD failure - IPs are not equal: (0,169.254.3.199,-1,169.254.3.199) != (1, 169.254.183.61)
    at scala.Predef$.assert(Predef.scala:179)
    at org.apache.spark.h2o.H2OContextUtils$$anonfun$7.apply(H2OContextUtils.scala:107)
    at org.apache.spark.h2o.H2OContextUtils$$anonfun$7.apply(H2OContextUtils.scala:106)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

16/06/30 15:10:52 ERROR TaskSetManager: Task 0 in stage 6.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "build/bdist.linux-x86_64/egg/pysparkling/context.py", line 72, in __init__
  File "build/bdist.linux-x86_64/egg/pysparkling/context.py", line 96, in _do_init
  File "/Users/hdadmin/spark-1.6.0/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/Users/hdadmin/spark-1.6.0/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/Users/hdadmin/spark-1.6.0/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.h2o.H2OContext.getOrCreate.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 132, pearson.local): java.lang.AssertionError: assertion failed: SpreadRDD failure - IPs are not equal: (0,169.254.3.199,-1,169.254.3.199) != (1, 169.254.183.61)
    at scala.Predef$.assert(Predef.scala:179)
    at org.apache.spark.h2o.H2OContextUtils$$anonfun$7.apply(H2OContextUtils.scala:107)
    at org.apache.spark.h2o.H2OContextUtils$$anonfun$7.apply(H2OContextUtils.scala:106)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
    at org.apache.spark.h2o.H2OContextUtils$.startH2O(H2OContextUtils.scala:169)
    at org.apache.spark.h2o.H2OContext.start(H2OContext.scala:227)
    at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:345)
    at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:375)
    at org.apache.spark.h2o.H2OContext.getOrCreate(H2OContext.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.AssertionError: assertion failed: SpreadRDD failure - IPs are not equal: (0,169.254.3.199,-1,169.254.3.199) != (1, 169.254.183.61)
    at scala.Predef$.assert(Predef.scala:179)
    at org.apache.spark.h2o.H2OContextUtils$$anonfun$7.apply(H2OContextUtils.scala:107)
    at org.apache.spark.h2o.H2OContextUtils$$anonfun$7.apply(H2OContextUtils.scala:106)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more

>>> 16/06/30 15:10:53 WARN TaskSetManager: Lost task 1.2 in stage 6.0 (TID 129, pearson.local): TaskKilled (killed intentionally)

Can Same architecture of Sparkling water be followed to Integrate H2O with FLINK

Hi Michal

I was working on integrating H20 with FLINK , but I observe that FLINK roadmap are following a Mahout DSL between Flink and Mahout along the same lines as the integration with Spark, rather than in the way it is done with H2O.

Refer to this link below
http://mail-archives.apache.org/mod_mbox/flink-dev/201501.mbox/%3CCANC1h_s=DtNjS+KQcU-Uxdb=i+_o4KPV-EOKQacr-KpPFX_OKw@mail.gmail.com%3E

Kindly advice are there any limitations / necessity for integrating FLINK with H20

I believe FLINK would benefit with the fact that
H20 provides deep learning out of the box
Using R Data frames
Please correct me if I am wrong

Raghav

Spark 1.3.0 is unsuported

Hi,

From the quick look I had, in core/src/main/scala/org/apache/spark/h2o/H2OContext.scala ExistingRdd and SparkLogicalPlan are now removed and replaced by LogicalRDD. Also, SchemaRDD has now been replaced by DataFrame.

Thanks!

Cannot execute H2O on all Spark executors

HI, I am getting a "java.lang.IllegalArgumentException: Cannot execute H2O on all Spark executors:
numH2OWorkers = -1"
executorStatus = (2,false),(1,false),(2,true),(1,true),(2,true),(1,true),(1,true),(2,true),(1,true),(2,true),(2,true),(1,true),(1,true),(2,true),(1,true),(2,true),(1,true),(2,true),(1,true),(2,true)
at org.apache.spark.h2o.H2OContext.start(H2OContext.scala:118)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:15)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:20)
at $iwC$$iwC$$iwC$$iwC.(:22)
at $iwC$$iwC$$iwC.(:24)
at $iwC$$iwC.(:26)
at $iwC.(:28)
at (:30)
at .(:34)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I have set the cluster master to yarn-client If I vary --num-executors then I get a number of false's at the beginning of the array equal to the number of exectors set.

So... odd..

ArrayIndexOutOfBoundsException when converting Spark DataFrame to H2OFrame

After initializing H2OContext like val h2o = H2OContext.getOrCreate(sc)

when I try to convert my Spark DF to H2OFrame

val h2oDF: H2OFrame = dataFrame

This gives me java.lang.ArrayIndexOutOfBoundsException. I am running on Spark 1.6.0 and Sparkling water 1.6.1. What might the reason be?

Thanks!

DistributedException caused by java.lang.NullPointerException

I am getting the error when running the below code :

scala> val dataFile = "examples/smalldata/allyears2k_headers.csv.gz"
dataFile: String = examples/smalldata/allyears2k_headers.csv.gz

scala>

scala> val airlinesData = new H2OFrame(new File(dataFile))
DistributedException from , caused by java.lang.NullPointerException
Caused by: java.lang.NullPointerException
at water.parser.ParseSetup$GuessSetupTsk.map(ParseSetup.java:400)
at water.MRTask.compute2(MRTask.java:595)
at water.H2O$H2OCountedCompleter.compute1(H2O.java:1201)
at water.parser.ParseSetup$GuessSetupTsk$Icer.compute1(ParseSetup$GuessSetupTsk$Icer.java)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1197)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

Sample does not work for Mesos Cluster

I can run sample using spark-submit against a --master=local[2]. However when I target it to my mesos cluster, I got NPE: by class water.parser.ParseSetup$GuessSetupTsk; class java.lang.NullPointerException: null
at water.parser.ParseSetup$GuessSetupTsk.map(ParseSetup.java:269)
at water.MRTask.compute2(MRTask.java:624)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1017)

If the issue is related to how the sample is loading the data, I wonder if we should be creating some sample that will work with remote clusters, e.g. hosting the data on S3..

Thank you .

Yang.

I have this problem after running the crime example using sparkling water 1.6.2 with python

File "/home/cloudera/crime.py", line 166, in df_weather = h2oContext.as_spark_frame(f_weather) File "build/bdist.linux-x86_64/egg/pysparkling/context.py", line 157, in as_spark_frame File "build/bdist.linux-x86_64/egg/pysparkling/context.py", line 33, in get_java_h2o_frame File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in call File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o47.asH2OFrame. : java.lang.ArrayIndexOutOfBoundsException: 65535 at water.DKV.get(DKV.java:202) at water.DKV.get(DKV.java:175) at water.fvec.H2OFrame.(H2OFrame.scala:37) at water.fvec.H2OFrame.(H2OFrame.scala:45) at org.apache.spark.h2o.H2OContext.asH2OFrame(H2OContext.scala:91) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745)

H2O Flow cannot plot data that has more than 20 columns

Platform : MapR 5.1
Package : http://h2o-release.s3.amazonaws.com/h2o/rel-tutte/1/h2o-3.10.2.1-mapr5.1.zip

Hi. I'm sorry for writing here , but I cannot find the issue page of H2O flow.
I met a trouble with H2O Flow, Web UI.

When I use a following plot function, if I select a 21th or later column, occurs an error.

plot (g) -> g(
g.point(
g.position "column1","column21"
)
g.from inspect "data", getFrame frame
)

Error evaluating cell
Error rendering vis.
Vector [column21] does not exist in frame [data]

I tried various combinations and found that 1st to 20th columns could be plotted, but 21th ~ columns could not.
Is there any solution to this problem?

Sparkling Water assembly jar includes non-shaded AWS SDK dependencies

We wish to build an spark job assembly jar including newer dependencies (supporting improvements to AWS CloudWatch Metrics for instance). Sparkling Water 2.0.0 (haven't checked newer versions) includes someone older AWS SDK libraries and dependencies for its S3 support which are causing conflicts. It would be great if all dependencies included in the Sparkling Water assembly were shaded so as to remove the burden on the user from shading all their dependencies (and figuring out where the conflicts are, etc.).

Is there a way to make H2O use domain name instead of IP in Flow URL?

Hello,
Sorry I'm writing here, but Flow and H2O-3 don't have their own Issues sections, it seems.

Can I make H2O print the domain name of the server instead of IP in the console log?
I mean this part of the log:


  Open H2O Flow in browser: http://<IP>:54321 (CMD + click in Mac OSX)

I mean is there an h2o internal parameter for this?
It causes me some problems when I try to use an SSL/TLS certificate (it's based on domain name) and the user gets a message that points to the IP. Another related problem is that the address doesn't take into account using SSL (prints 'http' instead of 'https').

I know it's a minor inconvenience, but it may be confusing the users.

Can't initialize H2OContext on IPv6-only machine

On a machine without IPv4 it simply crashes, even in local-only mode (master = local[*]):

scala> val h2oContext = new org.apache.spark.h2o.H2OContext(sc).start()
15/12/09 11:53:33 WARN H2OContext: Increasing 'spark.locality.wait' to value 30000
[Exit 255]

How to use pysparkling with LDAP?

Hello,

I'm running pysparkling (Sparkling Water 1.5.16) with a command like this:
$SPARKLING_HOME/bin/pysparkling --num-executors 3 --executor-memory 20g --executor-cores 10 --driver-memory 20g --master yarn-client --conf "spark.scheduler.minRegisteredResourcesRatio=1" --conf "spark.ext.h2o.topology.change.listener.enabled=false" --conf "spark.ext.h2o.node.network.mask=155.111.184.0/24" --conf "spark.ext.h2o.fail.on.unsupported.spark.param=false" --conf "spark.ext.h2o.ldap.login=true" --conf "spark.ext.h2o.login.conf=/opt/sparkling-water/ldap.conf" --conf "spark.ext.h2o.user.name=<USER>"

and when I try to start H2OContext:

from pysparkling import *
hc= H2OContext(sc).start()

I get a bunch of access error:

javax.security.auth.login.LoginException: User not found.
...
javax.security.auth.login.LoginException: Error obtaining user info.
...
ValueError: Cannot connect to H2O server. Please check that H2O is running at http://155.111.190.82:54321/3/

I kinit before starting the tool. Can you point me to the documentation that says how to pass a username to the start() command or otherwise solve this problem?

R + sparkling-water (H2O/Spark)

All of the demos/examples on the README.md seem to be doing all the prediction code in scala and only has R as an after thought using the residualPlotRCode function from here to visualise in R. Scala is on my "to learn" list , but in the mean time... Given h20's close connections with R, is it possible to see/have a sparkling water example/demo with R as the interface? and ideally with an EC2 example too to illustrate the benefits of distributed parallel computing? Maybe it might need to use SparkR? or something else? I'm not sure...

Trouble creating h2o_context - R

When I run the following code

sc <- spark_connect(master = "local")
h2o_context(sc)

I get the following error:

Error: failed to invoke spark command
16/11/01 21:39:05 ERROR getOrCreate on org.apache.spark.h2o.H2OContext failed

I have followed the instructions on http://spark.rstudio.com/h2o.html to no success. Thoughts?

What's the difference between "new H2OContext(sc).start" and "H2OContext.getOrCreate(sc)" ?

The DEVEL.md was modified in which new H2OContext(sc).start was changed to H2OContext.getOrCreate(sc) .
In sparkling-shell, it works fine when I use new H2OContext(sc).start , but I get 0 number of executors when use H2OContext.getOrCreate(sc) and the Flow URL is "http://null:0". The MASTER was set to " local-cluster[3,2,1024] ".
I didn't find any documentation about getOrCreate, so what's the difference and how should I decide which one to use ?
I use spark 1.5.2 and sparkling-water 1.5.9 .

Building project after cloning fails

[cloudera@quickstart sparkling-water]$ ./gradlew build -x test -x integTest --stacktrace
.
.
.
:sparkling-water-py:distPython
Traceback (most recent call last):
  File "setup.py", line 67, in <module>
    packages=find_packages(exclude=['contrib', 'docs', 'tests*']) + find_packages('build/dep'),
  File "/usr/lib/python2.6/site-packages/setuptools/__init__.py", line 46, in find_packages
    for name in os.listdir(where):
OSError: [Errno 2] No such file or directory: 'build/dep'
:sparkling-water-py:distPython FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':sparkling-water-py:distPython'.
> Process 'command 'python'' finished with non-zero exit value 1

* Try:
Run with --info or --debug option to get more log output.

* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':sparkling-water-py:distPython'.
    at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.executeActions(ExecuteActionsTaskExecuter.java:69)
    at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.execute(ExecuteActionsTaskExecuter.java:46)
    at org.gradle.api.internal.tasks.execution.PostExecutionAnalysisTaskExecuter.execute(PostExecutionAnalysisTaskExecuter.java:35)
    at org.gradle.api.internal.tasks.execution.SkipUpToDateTaskExecuter.execute(SkipUpToDateTaskExecuter.java:64)
    at org.gradle.api.internal.tasks.execution.ValidatingTaskExecuter.execute(ValidatingTaskExecuter.java:58)
    at org.gradle.api.internal.tasks.execution.SkipEmptySourceFilesTaskExecuter.execute(SkipEmptySourceFilesTaskExecuter.java:52)
    at org.gradle.api.internal.tasks.execution.SkipTaskWithNoActionsExecuter.execute(SkipTaskWithNoActionsExecuter.java:52)
    at org.gradle.api.internal.tasks.execution.SkipOnlyIfTaskExecuter.execute(SkipOnlyIfTaskExecuter.java:53)
    at org.gradle.api.internal.tasks.execution.ExecuteAtMostOnceTaskExecuter.execute(ExecuteAtMostOnceTaskExecuter.java:43)
    at org.gradle.execution.taskgraph.DefaultTaskGraphExecuter$EventFiringTaskWorker.execute(DefaultTaskGraphExecuter.java:203)
    at org.gradle.execution.taskgraph.DefaultTaskGraphExecuter$EventFiringTaskWorker.execute(DefaultTaskGraphExecuter.java:185)
    at org.gradle.execution.taskgraph.AbstractTaskPlanExecutor$TaskExecutorWorker.processTask(AbstractTaskPlanExecutor.java:66)
    at org.gradle.execution.taskgraph.AbstractTaskPlanExecutor$TaskExecutorWorker.run(AbstractTaskPlanExecutor.java:50)
    at org.gradle.execution.taskgraph.DefaultTaskPlanExecutor.process(DefaultTaskPlanExecutor.java:25)
    at org.gradle.execution.taskgraph.DefaultTaskGraphExecuter.execute(DefaultTaskGraphExecuter.java:110)
    at org.gradle.execution.SelectedTaskExecutionAction.execute(SelectedTaskExecutionAction.java:37)
    at org.gradle.execution.DefaultBuildExecuter.execute(DefaultBuildExecuter.java:37)
    at org.gradle.execution.DefaultBuildExecuter.access$000(DefaultBuildExecuter.java:23)
    at org.gradle.execution.DefaultBuildExecuter$1.proceed(DefaultBuildExecuter.java:43)
    at org.gradle.execution.DryRunBuildExecutionAction.execute(DryRunBuildExecutionAction.java:32)
    at org.gradle.execution.DefaultBuildExecuter.execute(DefaultBuildExecuter.java:37)
    at org.gradle.execution.DefaultBuildExecuter.execute(DefaultBuildExecuter.java:30)
    at org.gradle.initialization.DefaultGradleLauncher$4.run(DefaultGradleLauncher.java:154)
    at org.gradle.internal.Factories$1.create(Factories.java:22)
    at org.gradle.internal.progress.DefaultBuildOperationExecutor.run(DefaultBuildOperationExecutor.java:90)
    at org.gradle.internal.progress.DefaultBuildOperationExecutor.run(DefaultBuildOperationExecutor.java:52)
    at org.gradle.initialization.DefaultGradleLauncher.doBuildStages(DefaultGradleLauncher.java:151)
    at org.gradle.initialization.DefaultGradleLauncher.access$200(DefaultGradleLauncher.java:32)
    at org.gradle.initialization.DefaultGradleLauncher$1.create(DefaultGradleLauncher.java:99)
    at org.gradle.initialization.DefaultGradleLauncher$1.create(DefaultGradleLauncher.java:93)
    at org.gradle.internal.progress.DefaultBuildOperationExecutor.run(DefaultBuildOperationExecutor.java:90)
    at org.gradle.internal.progress.DefaultBuildOperationExecutor.run(DefaultBuildOperationExecutor.java:62)
    at org.gradle.initialization.DefaultGradleLauncher.doBuild(DefaultGradleLauncher.java:93)
    at org.gradle.initialization.DefaultGradleLauncher.run(DefaultGradleLauncher.java:82)
    at org.gradle.launcher.exec.InProcessBuildActionExecuter$DefaultBuildController.run(InProcessBuildActionExecuter.java:94)
    at org.gradle.tooling.internal.provider.ExecuteBuildActionRunner.run(ExecuteBuildActionRunner.java:28)
    at org.gradle.launcher.exec.ChainingBuildActionRunner.run(ChainingBuildActionRunner.java:35)
    at org.gradle.launcher.exec.InProcessBuildActionExecuter.execute(InProcessBuildActionExecuter.java:43)
    at org.gradle.launcher.exec.InProcessBuildActionExecuter.execute(InProcessBuildActionExecuter.java:28)
    at org.gradle.launcher.exec.ContinuousBuildActionExecuter.execute(ContinuousBuildActionExecuter.java:77)
    at org.gradle.launcher.exec.ContinuousBuildActionExecuter.execute(ContinuousBuildActionExecuter.java:47)
    at org.gradle.launcher.daemon.server.exec.ExecuteBuild.doBuild(ExecuteBuild.java:52)
    at org.gradle.launcher.daemon.server.exec.BuildCommandOnly.execute(BuildCommandOnly.java:36)
    at org.gradle.launcher.daemon.server.api.DaemonCommandExecution.proceed(DaemonCommandExecution.java:120)
    at org.gradle.launcher.daemon.server.exec.WatchForDisconnection.execute(WatchForDisconnection.java:37)
    at org.gradle.launcher.daemon.server.api.DaemonCommandExecution.proceed(DaemonCommandExecution.java:120)
    at org.gradle.launcher.daemon.server.exec.ResetDeprecationLogger.execute(ResetDeprecationLogger.java:26)
    at org.gradle.launcher.daemon.server.api.DaemonCommandExecution.proceed(DaemonCommandExecution.java:120)
    at org.gradle.launcher.daemon.server.exec.RequestStopIfSingleUsedDaemon.execute(RequestStopIfSingleUsedDaemon.java:34)
    at org.gradle.launcher.daemon.server.api.DaemonCommandExecution.proceed(DaemonCommandExecution.java:120)
    at org.gradle.launcher.daemon.server.exec.ForwardClientInput$2.call(ForwardClientInput.java:74)
    at org.gradle.launcher.daemon.server.exec.ForwardClientInput$2.call(ForwardClientInput.java:72)
    at org.gradle.util.Swapper.swap(Swapper.java:38)
    at org.gradle.launcher.daemon.server.exec.ForwardClientInput.execute(ForwardClientInput.java:72)
    at org.gradle.launcher.daemon.server.api.DaemonCommandExecution.proceed(DaemonCommandExecution.java:120)
    at org.gradle.launcher.daemon.server.health.DaemonHealthTracker.execute(DaemonHealthTracker.java:40)
    at org.gradle.launcher.daemon.server.api.DaemonCommandExecution.proceed(DaemonCommandExecution.java:120)
    at org.gradle.launcher.daemon.server.exec.LogToClient.doBuild(LogToClient.java:66)
    at org.gradle.launcher.daemon.server.exec.BuildCommandOnly.execute(BuildCommandOnly.java:36)
    at org.gradle.launcher.daemon.server.api.DaemonCommandExecution.proceed(DaemonCommandExecution.java:120)
    at org.gradle.launcher.daemon.server.exec.EstablishBuildEnvironment.doBuild(EstablishBuildEnvironment.java:72)
    at org.gradle.launcher.daemon.server.exec.BuildCommandOnly.execute(BuildCommandOnly.java:36)
    at org.gradle.launcher.daemon.server.api.DaemonCommandExecution.proceed(DaemonCommandExecution.java:120)
    at org.gradle.launcher.daemon.server.health.HintGCAfterBuild.execute(HintGCAfterBuild.java:41)
    at org.gradle.launcher.daemon.server.api.DaemonCommandExecution.proceed(DaemonCommandExecution.java:120)
    at org.gradle.launcher.daemon.server.exec.StartBuildOrRespondWithBusy$1.run(StartBuildOrRespondWithBusy.java:50)
    at org.gradle.launcher.daemon.server.DaemonStateCoordinator$1.run(DaemonStateCoordinator.java:246)
    at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:54)
    at org.gradle.internal.concurrent.StoppableExecutorImpl$1.run(StoppableExecutorImpl.java:40)
Caused by: org.gradle.process.internal.ExecException: Process 'command 'python'' finished with non-zero exit value 1
    at org.gradle.process.internal.DefaultExecHandle$ExecResultImpl.assertNormalExitValue(DefaultExecHandle.java:367)
    at org.gradle.process.internal.DefaultExecAction.execute(DefaultExecAction.java:31)
    at org.gradle.api.tasks.AbstractExecTask.exec(AbstractExecTask.java:54)
    at org.gradle.internal.reflect.JavaMethod.invoke(JavaMethod.java:75)
    at org.gradle.api.internal.project.taskfactory.AnnotationProcessingTaskFactory$StandardTaskAction.doExecute(AnnotationProcessingTaskFactory.java:227)
    at org.gradle.api.internal.project.taskfactory.AnnotationProcessingTaskFactory$StandardTaskAction.execute(AnnotationProcessingTaskFactory.java:220)
    at org.gradle.api.internal.project.taskfactory.AnnotationProcessingTaskFactory$StandardTaskAction.execute(AnnotationProcessingTaskFactory.java:209)
    at org.gradle.api.internal.AbstractTask$TaskActionWrapper.execute(AbstractTask.java:585)
    at org.gradle.api.internal.AbstractTask$TaskActionWrapper.execute(AbstractTask.java:568)
    at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.executeAction(ExecuteActionsTaskExecuter.java:80)
    at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.executeActions(ExecuteActionsTaskExecuter.java:61)
    ... 68 more


BUILD FAILED

Total time: 38.375 secs

It seems a similar error was raised some months ago https://travis-ci.org/h2oai/sparkling-water/builds/91055165

scoring job does not appear in H2O Flow ?

Hello,

I notice that nothing appear in the "getJobs" Flow cells when I perform a scoring job...
I find this a bit weird, as it doesn't allow me to get a feeling on the expected running time for my scoring jobs in sparkling water.
Is this normal?

I am working with sparkling-water-core_2.11 version 2.0.2

Thanks in advance,
Loic

Error: Sparkling Water version is not set

I've followed the installation instructions and insalled sparklyr (spark 1.6.2, local), rsparkling and the h2o package. Running:

library(sparklyr)
library(rsparkling)
library(dplyr)
sc <- spark_connect(master = "local", version = "1.6.2")

Gives:

Error in spark_dependencies(spark_version = spark_version, scala_version = scala_version) : 
  Sparkling Water version is not set. Please choose a correct version

Loaded packages:

dplyr_0.5.0      rsparkling_0.1.0 sparklyr_0.4    

Thank you very much!

Error running rsparkling in RStudio

Hi there,

I'm trying to set up and test a local instance of sparklingwater in RStudio (using sparklyr), and while I can initialize and run H2o, I get an error when I try to run as_h2o_frame() (see below). I'm on ubuntu running:

R 3.2.3
H2o 3.10.2.2
sparklyr 0.5.1 (with spark 2.0.2 / hadoop 2.7 / scala 2.11.6)
rsparkling 0.1.0

Many thanks for any help you can provide!

Rafael


Error: java.lang.ClassNotFoundException: org.apache.spark.h2o.H2OContext
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at sparklyr.StreamHandler$.handleMethodCall(stream.scala:77)
at sparklyr.StreamHandler$.read(stream.scala:55)
at sparklyr.BackendHandler.channelRead0(handler.scala:49)
at sparklyr.BackendHandler.channelRead0(handler.scala:14)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)) . . .
at java.lang.Thread.run(Thread.java:745)

Updating to scala 2.11

Hi Guys
Great product for spark. Are there any plans to upgrade to scala 2.11? Before I go ahead and split out my projects for different versions?

Thanks

rsparkling-- unable to connect_spark()

issue: unable to connect spark and use sparklingwater with rsparkling.
Here is the output:
library(sparklyr)
library(rsparkling)
library(dplyr)

sc <- spark_connect("local", version = "1.6.2")
Error in spark_dependencies(spark_version = spark_version, scala_version = scala_version) :
Sparkling Water version is not set. Please choose a correct version using options, for example: options(rsparkling.sparklingwater.version = '1.6.7')

(I follow the error message, set rsparkling version)

options(rsparkling.sparklingwater.version = '1.6.2')
sc <- spark_connect("local", version = "1.6.2")
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:\Users\silver16\AppData\Local\Temp\RtmpY9hwuq\file14e44ad1506f_spark.log': Permission denied

###platform information
platform: windows 8 or windows 7(same error for either platform)
JAVA_HOME version is set:

Sys.getenv("JAVA_HOME")
[1] "C:\Program Files\Java\jdk1.8.0_102"

#################
installation followed: http://spark.rstudio.com/h2o.html
specifically, here are the installation procedures(copied from the link above):
###Install h2o

Remove previous versions of h2o R package

if ("package:h2o" %in% search()) detach("package:h2o", unload=TRUE)
if ("h2o" %in% rownames(installed.packages())) remove.packages("h2o")

Next, we download R package dependencies

pkgs <- c("methods","statmod","stats","graphics",
"RCurl","jsonlite","tools","utils")
for (pkg in pkgs) {
if (!(pkg %in% rownames(installed.packages()))) install.packages(pkg)
}

Download h2o package version 3.10.0.6

install.packages("h2o", type = "source",
repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-turing/6/R")

###install spark
library(sparklyr)
spark_install(version = "1.6.2")

###Install rsparkling
library(devtools)
devtools::install_github("h2oai/sparkling-water", subdir = "/r/rsparkling")

############################
NOTE 1:
I also installed these packages during the previous steps(The installation errors complained these packages were not found):
install.packages("rprojroot", dependencies = T) ( for h2o package installation error)
install.packages("rappdirs", dependencies = T)

NOTE 2:
If I do not load rsparkling package, then I can connect to spark without error.

I suspect the installation of rsparkling was not successful, yet it did not give any error message when installation completed.

Thank you all.

Sparkling-water via spark packages doesn't work with Spark 1.6.1

Hi, I'm on Hortonworks HDP 2.4 with Scala 2.10 and Spark 1.6.1. According to the README here, I should be able to use sparkling-water via spark packages by using

spark-shell --packages ai.h2o:sparkling-water-core_2.10:1.6.8,ai.h2o:sparkling-water-examples_2.10:1.6.8

However it seems two dependencies: com.google.guava#guava;16.0.1!guava.jar(bundle) and com.google.code.findbugs#jsr305;3.0.0!jsr305.jar are not found. Here's what I get in the terminal:

:::: WARNINGS
        [NOT FOUND  ] com.google.guava#guava;16.0.1!guava.jar(bundle) (1ms)
==== local-m2-cache: tried

  file:/root/.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar

    [NOT FOUND  ] com.google.code.findbugs#jsr305;3.0.0!jsr305.jar (1ms)

==== local-m2-cache: tried

  file:/root/.m2/repository/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0.jar

    ::::::::::::::::::::::::::::::::::::::::::::::

    ::              FAILED DOWNLOADS            ::

    :: ^ see resolution messages for details  ^ ::

    ::::::::::::::::::::::::::::::::::::::::::::::

    :: com.google.guava#guava;16.0.1!guava.jar(bundle)

    :: com.google.code.findbugs#jsr305;3.0.0!jsr305.jar

    ::::::::::::::::::::::::::::::::::::::::::::::

I'm very new to this, any help is appreciated.

KMeansModel.KMeansParameter does not has a member '_estimate_k'

Platform : MapR 5.1
Package : Sparkling-water-1.6.8.zip
Language : Scala 2.10

Hi. I'm using a sparkling water with scala.
I tried KMeans and met an error.

--
$spark-shell --master yarn-client --jars "sparkling-water-1.6.8/assembly/build/libs/sparkling-water-assembly-1.6.8-all.jar"
scala>import root.hex.kmeans.KMeansModel.KMeansParameter
scala>val param = new KMeansParameter
scala>param._estimate_k
error:value _estimate_k is not a member of hex.kmeans.KMeansModel.KMeansParameters

Any other parameters are OK e.g. _k, _max_iterations,etc

Maybe, the '_estimate_k' is defined in the below code.
https://github.com/h2oai/h2o-3/blob/b0801ec5bdadb9e1ff55f995c572e4906bb27cf5/h2o-algos/src/main/java/hex/kmeans/KMeansModel.java

How can I use '_estimate_k'?

ArrayIndexOutOfBoundsException when loading a larger data sample

Hello, devs!

First of all, thanks for making this very nice library open source.

While experimenting with sparkling-water, I hit the following issue:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 97 in stage 25.0 failed 4 times, most recent failure: Lost task 97.3 in stage 25.0 (TID 2553, prtlap02): j
ava.lang.ArrayIndexOutOfBoundsException: 65535
        at water.DKV.get(DKV.java:202)
        at water.DKV.get(DKV.java:175)
        at water.Key.get(Key.java:83)
        at water.fvec.Frame.createNewChunks(Frame.java:887)
        at water.fvec.FrameUtils$class.createNewChunks(FrameUtils.scala:43)
        at water.fvec.FrameUtils$.createNewChunks(FrameUtils.scala:70)
        at org.apache.spark.h2o.backends.internal.InternalWriteConverterCtx.createChunks(InternalWriteConverterCtx.scala:29)
        at org.apache.spark.h2o.converters.SparkDataFrameConverter$.org$apache$spark$h2o$converters$SparkDataFrameConverter$$perSQLPartition(SparkDataFrameConverter.scala:94)
        at org.apache.spark.h2o.converters.SparkDataFrameConverter$$anonfun$toH2OFrame$1$$anonfun$apply$2.apply(SparkDataFrameConverter.scala:73)
        at org.apache.spark.h2o.converters.SparkDataFrameConverter$$anonfun$toH2OFrame$1$$anonfun$apply$2.apply(SparkDataFrameConverter.scala:73)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:86)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Please note that I am seeing this error only when trying to load a larger user sample (500K users) and it works fine with a smaller one (200K users). I am using sparkling-water v2.0.3 and the following deployment script for my application:

#!/usr/bin/env bash

# Change this to point to an existing log4j.properties file
LOGFILE=file:///home/sancho/app/log4j.properties

/usr/local/spark-2.0.1-bin-hadoop2.7/bin/spark-submit \
--driver-java-options "-Dspark.network.timeout=480s
-Dspark.cores.max=$3 -Dlog4j.configuration=$LOGFILE" \
--class $2 \
--conf "spark.cores.max=$3" \
--conf "spark.executor.cores=$4" \
--conf "spark.kryoserializer.buffer.max=1024m" \
--conf "spark.dynamicAllocation.enabled=false" \
--executor-memory 50G \
--driver-memory 15G \
$1 h2o 200 5 0.1

Prior to the ArrayIndexOutOfBoundsException, I see two warnings. I am not sure if they are related, but posting them here for reference:

2017-02-03 09:27:28,551 WARN  could not create Vfs.Dir from url. ignoring the exception and continuing <org.reflections.Reflections>
org.reflections.ReflectionsException: could not create Vfs.Dir from url, no matching UrlType was found [file:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/libatk-wrapper.so]
either use fromURL(final URL url, final List<UrlType> urlTypes) or use the static setDefaultURLTypes(final List<UrlType> urlTypes) or addDefaultURLTypes(UrlType urlType) with your specialized UrlType.
        at org.reflections.vfs.Vfs.fromURL(Vfs.java:109)
        at org.reflections.vfs.Vfs.fromURL(Vfs.java:91)
        at org.reflections.Reflections.scan(Reflections.java:237)
        at org.reflections.Reflections.scan(Reflections.java:204)
        at org.reflections.Reflections.<init>(Reflections.java:129)
        at org.reflections.Reflections.<init>(Reflections.java:170)
        at org.reflections.Reflections.<init>(Reflections.java:143)
        at water.api.SchemaServer.registerAllSchemasIfNecessary(SchemaServer.java:197)
        at water.H2O.finalizeRegistration(H2O.java:1512)
        at org.apache.spark.h2o.backends.internal.InternalH2OBackend.init(InternalH2OBackend.scala:115)
        at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:102)
        at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:279)
        at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:301)
        at tera.profile.experimental.ml.App$.main(App.scala:41)
        at tera.profile.experimental.ml.App.main(App.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2017-02-03 09:27:28,699 WARN  could not create Dir using jarFile from url file:/mnt/ux0//hadoop-2.7.2/share/hadoop/tools/lib/hadoop-aws-2.7.1.jar. skipping. <org.reflections.Reflections>
java.lang.NullPointerException
        at java.util.zip.ZipFile.<init>(ZipFile.java:207)
        at java.util.zip.ZipFile.<init>(ZipFile.java:149)
        at java.util.jar.JarFile.<init>(JarFile.java:166)
        at java.util.jar.JarFile.<init>(JarFile.java:130)
        at org.reflections.vfs.Vfs$DefaultUrlTypes$1.createDir(Vfs.java:212)
        at org.reflections.vfs.Vfs.fromURL(Vfs.java:99)
        at org.reflections.vfs.Vfs.fromURL(Vfs.java:91)
        at org.reflections.Reflections.scan(Reflections.java:237)
        at org.reflections.Reflections.scan(Reflections.java:204)
        at org.reflections.Reflections.<init>(Reflections.java:129)
        at org.reflections.Reflections.<init>(Reflections.java:170)
        at org.reflections.Reflections.<init>(Reflections.java:143)
        at water.api.SchemaServer.registerAllSchemasIfNecessary(SchemaServer.java:197)
        at water.H2O.finalizeRegistration(H2O.java:1512)
        at org.apache.spark.h2o.backends.internal.InternalH2OBackend.init(InternalH2OBackend.scala:115)
        at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:102)
        at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:279)
        at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:301)
        at tera.profile.experimental.ml.App$.main(App.scala:41)
        at tera.profile.experimental.ml.App.main(App.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I find the second warning a bit weird: it's looking for a 2.7.1 file inside the Hadoop 2.7.2 folder. This file is not present and instead I have hadoop-aws-2.7.2.jar.

I hope you can point me in the right direction, as I am not sure how to debug this further.

Thanks!

Windows sparkling-shell cannot find version

Hi, testing out using windows and I'm getting an error after the cmd script fails to find a versions nubmer:

sparkling-shell --conf "spark.executor.memory=1g"
find: 'version': No such file or directory
You are trying to use Sparkling Water built for Spark , but your %SPARK_HOME(=C:/spark-2.0.0-bin-hadoop2.7) property points to Spark of version ~-5. Please ensure correct Spark is provided and re-run Sparkling Water.

AttributeError: 'H2OContext' object has no attribute 'start'

C02S81V1G8WM:Downloads me$ $SPARK_HOME/bin/spark-submit --jars $SPARKLING_HOME/assembly/build/libs/$FAT_JAR --master local[*] --driver-memory 2g --driver-java-options "$SCRIPT_H2O_SYS_OPS" --driver-class-path $SPARKLING_HOME/assembly/build/libs/$FAT_JAR --py-files $PY_EGG_FILE --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=384m" ChicagoCrimeDemo.py
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=384m; support was removed in 8.0
17/02/10 12:43:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/Users/me/Downloads/ChicagoCrimeDemo.py", line 93, in
h2oContext = H2OContext(sc).start()
AttributeError: 'H2OContext' object has no attribute 'start'

H2OContext(sc) fails: AttributeError: 'H2OContext' object has no attribute '_client_ip'

Hi guys,

Thanks for this great package.
I'm having problem initializing H2OContext.

Using:
sparkling-water 1.6.7 (downloaded from your website)
spark 1.6.2 (pre-built for hadoop 2.6)
OS X 10.11.6

Thanks
Imri

imris1:sparkling-water-1.6.7 imris$ bin/pysparkling
Python 2.7.12 |Anaconda custom (x86_64)| (default, Jul  2 2016, 17:43:17)
Type "copyright", "credits" or "license" for more information.

IPython 4.2.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
16/09/19 09:10:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/19 09:10:39 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
16/09/19 09:10:39 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.2
      /_/

Using Python version 2.7.12 (default, Jul  2 2016 17:43:17)
SparkContext available as sc, HiveContext available as sqlContext.

In [1]: from pysparkling import *

In [2]: import h2o

In [3]: H2OContext(sc)
/Users/imris/anaconda/lib/python2.7/site-packages/IPython/core/formatters.py:92: DeprecationWarning: DisplayFormatter._ipython_display_formatter_default is deprecated: use @default decorator instead.
  def _ipython_display_formatter_default(self):
/Users/imris/anaconda/lib/python2.7/site-packages/IPython/core/formatters.py:669: DeprecationWarning: PlainTextFormatter._singleton_printers_default is deprecated: use @default decorator instead.
  def _singleton_printers_default(self):
/Users/imris/anaconda/lib/python2.7/site-packages/IPython/core/formatters.py:672: DeprecationWarning: PlainTextFormatter._type_printers_default is deprecated: use @default decorator instead.
  def _type_printers_default(self):
/Users/imris/anaconda/lib/python2.7/site-packages/IPython/core/formatters.py:677: DeprecationWarning: PlainTextFormatter._deferred_printers_default is deprecated: use @default decorator instead.
  def _deferred_printers_default(self):
Out[3]: ---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/Users/imris/anaconda/lib/python2.7/site-packages/IPython/core/formatters.pyc in __call__(self, obj)
    697                 type_pprinters=self.type_printers,
    698                 deferred_pprinters=self.deferred_printers)
--> 699             printer.pretty(obj)
    700             printer.flush()
    701             return stream.getvalue()

/Users/imris/anaconda/lib/python2.7/site-packages/IPython/lib/pretty.pyc in pretty(self, obj)
    381                             if callable(meth):
    382                                 return meth(obj, self, cycle)
--> 383             return _default_pprint(obj, self, cycle)
    384         finally:
    385             self.end_group()

/Users/imris/anaconda/lib/python2.7/site-packages/IPython/lib/pretty.pyc in _default_pprint(obj, p, cycle)
    501     if _safe_getattr(klass, '__repr__', None) not in _baseclass_reprs:
    502         # A user-provided repr. Find newlines and replace them with p.break_()
--> 503         _repr_pprint(obj, p, cycle)
    504         return
    505     p.begin_group(1, '<')

/Users/imris/anaconda/lib/python2.7/site-packages/IPython/lib/pretty.pyc in _repr_pprint(obj, p, cycle)
    692     """A pprint that just redirects to the normal repr function."""
    693     # Find newlines and replace them with p.break_()
--> 694     output = repr(obj)
    695     for idx,output_line in enumerate(output.splitlines()):
    696         if idx:

/private/var/folders/t6/f07g1ljj49v7krpwn4nn9plr0000gn/T/imris/spark/work/spark-24b44179-b14f-4f52-a152-51049beb07f4/userFiles-15d39ea2-becd-44ec-a2b7-da4d93155b4d/h2o_pysparkling_1.6-1.6.7-py2.7.egg/pysparkling/context.pyc in __repr__(self)
    142
    143     def __repr__(self):
--> 144         self.show()
    145         return ""
    146

/private/var/folders/t6/f07g1ljj49v7krpwn4nn9plr0000gn/T/imris/spark/work/spark-24b44179-b14f-4f52-a152-51049beb07f4/userFiles-15d39ea2-becd-44ec-a2b7-da4d93155b4d/h2o_pysparkling_1.6-1.6.7-py2.7.egg/pysparkling/context.pyc in show(self)
    146
    147     def show(self):
--> 148         print self
    149
    150     def get_conf(self):

/private/var/folders/t6/f07g1ljj49v7krpwn4nn9plr0000gn/T/imris/spark/work/spark-24b44179-b14f-4f52-a152-51049beb07f4/userFiles-15d39ea2-becd-44ec-a2b7-da4d93155b4d/h2o_pysparkling_1.6-1.6.7-py2.7.egg/pysparkling/context.pyc in __str__(self)
    139
    140     def __str__(self):
--> 141         return "H2OContext: ip={}, port={} (open UI at http://{}:{} )".format(self._client_ip, self._client_port, self._client_ip, self._client_port)
    142
    143     def __repr__(self):

AttributeError: 'H2OContext' object has no attribute '_client_ip'

NullPointerException in Docker run-example.sh

Getting the following exception after running bin/run-example.sh within the bash container:

15/10/20 16:29:02 INFO TaskSetManager: Finished task 196.0 in stage 17.0 (TID 1040) in 565 ms on 0b4ad106ca69 (200/200)
15/10/20 16:29:02 INFO TaskSchedulerImpl: Removed TaskSet 17.0, whose tasks have all completed, from pool
15/10/20 16:29:02 INFO DAGScheduler: Stage 17 (runJob at H2OContext.scala:258) finished in 20.698 s
15/10/20 16:29:02 INFO DAGScheduler: Job 8 finished: runJob at H2OContext.scala:258, took 21.150594 s
Exception in thread "main" java.lang.NullPointerException
    at org.apache.spark.examples.h2o.AirlinesWithWeatherDemo2$.main(AirlinesWithWeatherDemo2.scala:83)
    at org.apache.spark.examples.h2o.AirlinesWithWeatherDemo2.main(AirlinesWithWeatherDemo2.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

^C15/10/20 16:32:24 INFO ExecutorRunner: Killing process!
15/10/20 16:32:24 INFO ExecutorRunner: Killing process!
15/10/20 16:32:24 INFO ExecutorRunner: Killing process!
15/10/20 16:32:24 ERROR FileAppender: Error writing stream to file /opt/spark/work/app-20151020162558-0000/1/stderr
java.io.IOException: Stream closed
    at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
    at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
    at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
    at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
    at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
15/10/20 16:32:24 ERROR FileAppender: Error writing stream to file /opt/spark/work/app-20151020162558-0000/0/stderr
java.io.IOException: Stream closed
    at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
    at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
    at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
    at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
    at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
15/10/20 16:32:24 ERROR FileAppender: Error writing stream to file /opt/spark/work/app-20151020162558-0000/2/stderr
java.io.IOException: Stream closed
    at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
    at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
    at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
    at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
    at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
15/10/20 16:32:25 INFO Worker: Executor app-20151020162558-0000/2 finished with state EXITED message Command exited with code 130 exitStatus 130
15/10/20 16:32:25 INFO Worker: Executor app-20151020162558-0000/0 finished with state EXITED message Command exited with code 130 exitStatus 130
15/10/20 16:32:25 ERROR TaskSchedulerImpl: Lost executor 0 on 0b4ad106ca69: remote Akka client disassociated

Sparkling Water on CDH 5.9's Spark 1.6

Hi,

On Sparkling Water 1.6.8 & CDH 5.9's Spark 1.6, when creating a H2OContext, I get the folowing, which I could not reproduce with other versions of CDH or Spark:

java.lang.AbstractMethodError
     at org.apache.spark.Logging$class.log(Logging.scala:50)
     at org.apache.spark.h2o.backends.internal.InternalH2OBackend.log(InternalH2OBackend.scala:31)
     at org.apache.spark.Logging$class.logWarning(Logging.scala:70)
     at org.apache.spark.h2o.backends.internal.InternalH2OBackend.logWarning(InternalH2OBackend.scala:31)
     at org.apache.spark.h2o.backends.SharedBackendUtils$class.checkAndUpdateConf(SharedBackendUtils.scala:54)
     at org.apache.spark.h2o.backends.internal.InternalH2OBackend.checkAndUpdateConf(InternalH2OBackend.scala:40)
     at org.apache.spark.h2o.H2OContext.<init>(H2OContext.scala:83)
     at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:262)
     at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:277)

This reminds me of SW-176, where CDH 5.7's Spark 1.6.0 had an additional method on some class (compared to Spark's official source code).

cannot start pysparkling

trying pysparkling in CDH5.8, using sparkling-water 1.6.8
when I ran
from pysparkling import *
hc = H2OContext.getOrCreate(sc)

I got error:

Py4JJavaError Traceback (most recent call last)
in ()
1 from pysparkling import *
----> 2 hc = H2OContext.getOrCreate(sc)

/tmp/cloudera/spark/work/spark-f3fffd6e-0906-45c4-a4d3-4bb2008d6e46/userFiles-f3934596-ade2-43ff-b59a-642c416f5ae2/h2o_pysparkling_1.6-1.6.8-py2.7.egg/pysparkling/context.pyc in getOrCreate(spark_context, conf)
126 selected_conf = H2OConf(spark_context)
127 method_params[1] = selected_conf._jconf
--> 128 jhc = method.invoke(None, method_params)
129 h2o_context._jhc = jhc
130 h2o_context._conf = selected_conf

/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in call(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814
815 for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
43 def deco(*a, **kw):
44 try:
---> 45 return f(*a, **kw)
46 except py4j.protocol.Py4JJavaError as e:
47 s = e.java_exception.toString()

/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(

Py4JJavaError: An error occurred while calling o74.invoke.
: java.lang.AbstractMethodError
at org.apache.spark.Logging$class.log(Logging.scala:50)
at org.apache.spark.h2o.backends.internal.InternalH2OBackend.log(InternalH2OBackend.scala:31)
at org.apache.spark.Logging$class.logWarning(Logging.scala:70)
at org.apache.spark.h2o.backends.internal.InternalH2OBackend.logWarning(InternalH2OBackend.scala:31)
at org.apache.spark.h2o.backends.SharedBackendUtils$class.checkAndUpdateConf(SharedBackendUtils.scala:54)
at org.apache.spark.h2o.backends.internal.InternalH2OBackend.checkAndUpdateConf(InternalH2OBackend.scala:40)
at org.apache.spark.h2o.H2OContext.(H2OContext.scala:83)
at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:262)
at org.apache.spark.h2o.H2OContext.getOrCreate(H2OContext.scala)
at org.apache.spark.h2o.JavaH2OContext.getOrCreate(JavaH2OContext.java:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)

Unable to access Hive table from pysprkling (permission denied)

Hello,
I'm running pysparkling (Sparkling Water 1.5.16) with a command:
$SPARKLING_HOME/bin/pysparkling --num-executors 3 --executor-memory 20g --executor-cores 10 --driver-memory 20g --master yarn-client --conf "spark.scheduler.minRegisteredResourcesRatio=1" --conf "spark.ext.h2o.topology.change.listener.enabled=false" --conf "spark.ext.h2o.node.network.mask=155.111.184.0/24" --conf "spark.ext.h2o.fail.on.unsupported.spark.param=false"

In the console I run the following set of commands:

from pysparkling import *
hc= H2OContext(sc).start()
import h2o
query="select * from default.table"
combinations=sqlContext.sql(query)
h2oframe = hc.as_h2o_frame(combinations)

The lat one fails with an error:

Permission denied: user=, access=READ_EXECUTE, inode="/user/hive/warehouse/table":hive:hive:drwxrwx--t

I kinited before starting pysparkling and I tested that I actually can run the same query in Hive via HUE. Can you please help me debug the case?

Sparkling Water on CDH 5.8 Sandbox Spark 1.6

I run sparkling 1.6.8 on spark 1.6, and i got the following error.

Sparkling Water Context:
 * H2O name: sparkling-water-cloudera_-1596704912
 * cluster size: 1
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (driver,localhost,54321)
  ------------------------

  Open H2O Flow in browser: http://127.0.0.1:54321 (CMD + click in Mac OSX)
    
Exception in thread "main" DistributedException from localhost/127.0.0.1:54321, caused by water.parser.ParseDataset$H2OParseException: Cannot determine file type. for hdfs://127.0.0.1/user/cloudera/allyears2k_headers.csv.gz
	at water.MRTask.getResult(MRTask.java:477)
	at water.MRTask.getResult(MRTask.java:485)
	at water.MRTask.doAll(MRTask.java:401)
	at water.parser.ParseSetup.guessSetup(ParseSetup.java:263)
	at water.parser.ParseSetup.guessSetup(ParseSetup.java:246)
	at water.parser.ParseDataset.parse(ParseDataset.java:38)
	at water.parser.ParseDataset.parse(ParseDataset.java:32)
	at water.util.FrameUtils.parseFrame(FrameUtils.java:58)
	at water.util.FrameUtils.parseFrame(FrameUtils.java:47)
	at water.fvec.H2OFrame.<init>(H2OFrame.scala:66)
	at H2ODeepLearning$.main(H2ODeepLearning.scala:36)
	at H2ODeepLearning.main(H2ODeepLearning.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: water.parser.ParseDataset$H2OParseException: Cannot determine file type. for hdfs://127.0.0.1/user/cloudera/allyears2k_headers.csv.gz
	at water.parser.ParseSetup.guessSetup(ParseSetup.java:541)
	at water.parser.ParseSetup.guessSetup(ParseSetup.java:533)
	at water.parser.ParseSetup$GuessSetupTsk.map(ParseSetup.java:350)
	at water.MRTask.compute2(MRTask.java:595)
	at water.H2O$H2OCountedCompleter.compute1(H2O.java:1201)
	at water.parser.ParseSetup$GuessSetupTsk$Icer.compute1(ParseSetup$GuessSetupTsk$Icer.java)
	at water.H2O$H2OCountedCompleter.compute(H2O.java:1197)
	at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
	at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
	at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
	at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
	at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

Because there is only one node in the Sandbox, so I referenced previous issues and set spark configuration as following command:

spark-submit --master local[*] --conf spark.dynamicAllocation.enabled=false --conf spark.ext.h2o.repl.enabled=false --conf spark.ext.h2o.backend.cluster.mode=internal --class H2ODeepLearning SparkApp-assembly-1.0.jar

and the exception occur when call H2OFrame function

val conf = configure("Sparkling Water: Deep Learning on Airlines data")
    val sc = new SparkContext(conf)
    //addFiles(sc, absPath("allyears2k_headers.csv.gz"))

    implicit val sqlContext = SQLContext.getOrCreate(sc)
    import sqlContext.implicits._ // import implicit conversions

    // Run H2O cluster inside Spark cluster
    val h2oContext = H2OContext.getOrCreate(sc)
    import h2oContext._
    import h2oContext.implicits._

    // Load H2O from CSV file (i.e., access directly H2O cloud)
    // Use super-fast advanced H2O CSV parser !!!
    // val airlinesData = new H2OFrame(new File(SparkFiles.get("allyears2k_headers.csv.gz")))
    val airlinesData = new H2OFrame(new java.net.URI("hdfs://127.0.0.1/user/cloudera/allyears2k_headers.csv.gz"))

and my build.sbt

name := "SparkApp"

version := "1.0"

scalaVersion := "2.10.5"

libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.0" % "provided"

libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.6.0" % "provided"

libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.6.0" % "provided"

libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.6.0" % "provided"

libraryDependencies += "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.0" % "provided"

libraryDependencies += "javax.xml.bind" % "jsr173_api" % "1.0" % "provided"

libraryDependencies += "ai.h2o" % "sparkling-water-core_2.10" % "1.6.8"

libraryDependencies += "ai.h2o" % "sparkling-water-examples_2.10" % "1.6.8"

resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
  case x => MergeStrategy.first
}

error with H2OContext().start()

i run sparkling 1.6.3 on spark 1.6, and i got the following error.

Exception in thread "main" java.lang.NoSuchFieldException: classServer
at java.lang.Class.getDeclaredField(Class.java:1948)
at org.apache.spark.repl.h2o.H2OIMain.stopClassServer(H2OIMain.scala:74)
at org.apache.spark.repl.h2o.H2OIMain.(H2OIMain.scala:41)
at org.apache.spark.repl.h2o.H2OIMain$.createInterpreter(H2OIMain.scala:180)
at org.apache.spark.repl.h2o.H2OInterpreter.createInterpreter(H2OInterpreter.scala:151)
at org.apache.spark.repl.h2o.H2OInterpreter.initializeInterpreter(H2OInterpreter.scala:105)
at org.apache.spark.repl.h2o.H2OInterpreter.(H2OInterpreter.scala:328)
at water.api.scalaInt.ScalaCodeHandler.createInterpreterInPool(ScalaCodeHandler.scala:100)
at water.api.scalaInt.ScalaCodeHandler$$anonfun$initializeInterpeterPool$1.apply(ScalaCodeHandler.scala:94)
at water.api.scalaInt.ScalaCodeHandler$$anonfun$initializeInterpeterPool$1.apply(ScalaCodeHandler.scala:93)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at water.api.scalaInt.ScalaCodeHandler.initializeInterpeterPool(ScalaCodeHandler.scala:93)
at water.api.scalaInt.ScalaCodeHandler.(ScalaCodeHandler.scala:37)
at org.apache.spark.h2o.H2OContext$.registerScalaIntEndp(H2OContext.scala:830)
at org.apache.spark.h2o.H2OContext$.registerClientWebAPI(H2OContext.scala:750)
at org.apache.spark.h2o.H2OContext.start(H2OContext.scala:225)
at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:337)
at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:363)
at org.apache.spark.examples.h2o.AirlinesWithWeatherDemo2$.main(AirlinesWithWeatherDemo2.scala:44)
at org.apache.spark.examples.h2o.AirlinesWithWeatherDemo2.main(AirlinesWithWeatherDemo2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

[ERROR] Executor without H2O instance discovered, killing the cloud!

I'm getting the error mentioned in the title. No clue why.

The command I use to run Sparkling Water is: spark-submit --class water.SparklingWaterDriver --master yarn-client --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 /opt/sparkling-water/sparkling-water-1.5.14/assembly/build/libs/*.jar

Full error stacktrace looks like this:

16/05/16 09:24:15 ERROR LiveListenerBus: Listener anon1 threw an exception
java.lang.IllegalArgumentException: Executor without H2O instance discovered, killing the cloud!
at org.apache.spark.h2o.H2OContext$$anon$1.onExecutorAdded(H2OContext.scala:180)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:58)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
16/05/16 09:24:16 INFO BlockManagerMasterEndpoint: Registering block manager bda1node05.na.pg.com:17644 with 1060.0 MB RAM, BlockManagerId(4, bda1node05.na.pg.com, 17644)
Exception in thread "main" java.lang.RuntimeException: Cloud size under 3
at water.H2O.waitForCloudSize(H2O.java:1547)
at org.apache.spark.h2o.H2OContext.start(H2OContext.scala:223)
at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:337)
at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:363)
at water.SparklingWaterDriver$.main(SparklingWaterDriver.scala:38)
at water.SparklingWaterDriver.main(SparklingWaterDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Error compiling "./gradlew buid"

When I attempt to compile on x86_64 system the build fails with:

H2OContextUtils.scala:52: value actorSystem in class SparkEnv is deprecated: Actor system is no longer supported as of 1.4.0
env.actorSystem.settings.config.getString("akka.remote.netty.tcp.hostname")
^
there were 1 feature warning(s); re-run with -feature for details
three warnings found
:sparkling-water-core:compileScala FAILED

FAILURE: Build failed with an exception.

What can I do to address this.

I am using rel-1.6

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.