The amunategui.github.io from amunategui

Trouble loading huge data sets on EC2 Spark Clusters

Hi Sir,

I have followed your guide, and it was a great help !!
I have launched one Master and two Slaves with 8 GB memory from AWS EC2.
I wish to load 220MB CSV files from S3 bucket and then I try to convert in Spark Data Frame. But I am encountering error regarding Heap Size. Following is the code and the error.

CODE:
library("SparkR",lib.loc="/root/spark/R/lib")
library(data.table)
Sys.setenv(SPARK_HOME="/root/spark")
sc<- sparkR.init()
sqlContext<- sparkRSQL.init(sc)
data_load<- read.csv("https://s3-ap-southeast-1.amazonaws.com/foldername/filename.csv")
dfr<- createDataFrame(sqlContext, data_load)

ERROR:

java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.api.r.SerDe$.readBytes(SerDe.scala:95)
at org.apache.spark.api.r.SerDe$$anonfun$readBytesArr$1.apply(SerDe.scala:140)
at org.apache.spark.api.r.SerDe$$anonfun$readBytesArr$1.apply(SerDe.scala:140)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.api.r.SerDe$.readBytesArr(SerDe.scala:140)
at org.apache.spark.api.r.SerDe$.readArray(SerDe.scala:172)
at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:74)
at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:60)
at org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182)
at org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.api.r.RBackendHandler.readArgs(RBackendHandler.scala:181)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:123)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
Error in if (returnStatus != 0) { : argument is of length zero

Please guide me for the same! Thanks in advance !

fork as template

Hi,

I am not sure if this is the correct platform, but is it alright for me to fork this repo as a template for my own blog/profile?

Regards
Germayne

caret + tbl_df incompatibility

Thank you for your helpful introduction to modeling binary outcomes using caret. I reproduced your example successfully, and following the same steps, I encountered a frustrating issue with my own data. It failed at the GBM train() line, dropping into an unhelpful debug mode. It turns out that this issue is related to an incompatibility between caret and the increasingly popular tbl_df (tidyverse 'tibble') . The issue is documented: topepo/caret#145 and topepo/caret#611 and http://stackoverflow.com/questions/29802216/caret-error-using-gbm-but-not-without-caret . The cause is not intuitive, but solving the problem is easy:
df <- as.data.frame(df)
It is a peculiar coincidence that the first example in your most helpful guide leads to this error. If you revise your guide, it would be helpful to include a mention of this issue. Thanks again!

amunategui / amunategui.github.io Goto Github PK

amunategui.github.io's People

Contributors

Stargazers

Watchers

Forkers

amunategui.github.io's Issues

Very helpful post on the setting up of Flask for EC2. Saved my ass.

Trouble loading huge data sets on EC2 Spark Clusters

fork as template

caret + tbl_df incompatibility

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent