Giter Site home page Giter Site logo

caspark's Introduction

Machine Learning with Apache Spark & Cassandra

Cassandra + Spark = ❤️

A Hands-on Lab delivered by DataStax' Developer Advocates team. Want to learn the awesomness of distributed databases and computational systems? Jump in, watch the slides and do the practicals steps!

Slides

Labs

Reqs

  • git
  • docker
  • docker-compose

Installation

git clone https://github.com/HadesArchitect/CaSpark.git
cd CaSpark
docker-compose up -d

Usage

You may need to use some custom IP instead of localhost if you use docker-for-mac, docker-for-windows or similar installation.

Known Issues

In some cases executing the exercises may lead to memory issues, especially on weaker or non-Linux machines due to docker limitations on memory. If you have any issues with exercises after the first few, try to clean up and start again docker-compose kill && docker-compose down && docker-compose up -d. You may need to repeat steps of the notebook you were working on.

caspark's People

Contributors

hadesarchitect avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

caspark's Issues

Exception when running kmeans.ipynb

When running kmeans.ipynb, on the step:

spark = SparkSession.builder.appName('demo').master("local").getOrCreate()

socialDF = spark.read.format("org.apache.spark.sql.cassandra").options(table="socialmedia", keyspace="accelerate").load()

print ("Table Row Count: ")
print (socialDF.count())

I receive this exception:
Py4JJavaError: An error occurred while calling o38.load.
: java.lang.NoClassDefFoundError: scala/Product$class
at com.datastax.spark.connector.util.ConfigParameter.(ConfigParameter.scala:7)
at com.datastax.spark.connector.rdd.ReadConf$.(ReadConf.scala:33)
at com.datastax.spark.connector.rdd.ReadConf$.(ReadConf.scala)
at org.apache.spark.sql.cassandra.DefaultSource$.(DefaultSource.scala:134)
at org.apache.spark.sql.cassandra.DefaultSource$.(DefaultSource.scala)
at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:55)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 23 more

After some reading, I have found that some of the early versions of the spark-cassandra-connector were not compatible with scala 2.11, but I cannot figure out how to fix this.

Jupyter Notebook breaks while running Docker

On my PC with Ubuntu 18 Jupyter Notebook breaks down while running Docker. After running

docker-compose up -d

in logs I have an error:

jupyter_1  | Traceback (most recent call last):
jupyter_1  |   File "/opt/conda/lib/python3.8/site-packages/traitlets/traitlets.py", line 535, in get
jupyter_1  |     value = obj._trait_values[self.name]
jupyter_1  | KeyError: 'runtime_dir'
jupyter_1  | 
jupyter_1  | During handling of the above exception, another exception occurred:
jupyter_1  | 
jupyter_1  | Traceback (most recent call last):
jupyter_1  |   File "/opt/conda/bin/jupyter-notebook", line 11, in <module>
jupyter_1  |     sys.exit(main())
jupyter_1  |   File "/opt/conda/lib/python3.8/site-packages/jupyter_core/application.py", line 270, in launch_instance
jupyter_1  |     return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
jupyter_1  |   File "/opt/conda/lib/python3.8/site-packages/traitlets/config/application.py", line 836, in launch_instance
jupyter_1  |     app.initialize(argv)
jupyter_1  |   File "/opt/conda/lib/python3.8/site-packages/traitlets/config/application.py", line 86, in inner
jupyter_1  |     return method(app, *args, **kwargs)
jupyter_1  |   File "/opt/conda/lib/python3.8/site-packages/notebook/notebookapp.py", line 2034, in initialize
jupyter_1  |     self.init_configurables()
jupyter_1  |   File "/opt/conda/lib/python3.8/site-packages/notebook/notebookapp.py", line 1563, in init_configurables
jupyter_1  |     connection_dir=self.runtime_dir,
jupyter_1  |   File "/opt/conda/lib/python3.8/site-packages/traitlets/traitlets.py", line 575, in __get__
jupyter_1  |     return self.get(obj, cls)
jupyter_1  |   File "/opt/conda/lib/python3.8/site-packages/traitlets/traitlets.py", line 538, in get
jupyter_1  |     default = obj.trait_defaults(self.name)
jupyter_1  |   File "/opt/conda/lib/python3.8/site-packages/traitlets/traitlets.py", line 1577, in trait_defaults
jupyter_1  |     return self._get_trait_default_generator(names[0])(self)
jupyter_1  |   File "/opt/conda/lib/python3.8/site-packages/jupyter_core/application.py", line 100, in _runtime_dir_default
jupyter_1  |     ensure_dir_exists(rd, mode=0o700)
jupyter_1  |   File "/opt/conda/lib/python3.8/site-packages/jupyter_core/utils/__init__.py", line 13, in ensure_dir_exists
jupyter_1  |     os.makedirs(path, mode=mode)
jupyter_1  |   File "/opt/conda/lib/python3.8/os.py", line 213, in makedirs
jupyter_1  |     makedirs(head, exist_ok=exist_ok)
jupyter_1  |   File "/opt/conda/lib/python3.8/os.py", line 213, in makedirs
jupyter_1  |     makedirs(head, exist_ok=exist_ok)
jupyter_1  |   File "/opt/conda/lib/python3.8/os.py", line 213, in makedirs
jupyter_1  |     makedirs(head, exist_ok=exist_ok)
jupyter_1  |   File "/opt/conda/lib/python3.8/os.py", line 223, in makedirs
jupyter_1  |     mkdir(name, mode)
jupyter_1  | PermissionError: [Errno 13] Permission denied: '/home/jovyan/.local'

`Py4JJavaError` while running kmeans.ipynb through docker-compose

While executing this particular line in kmeans.ipynb the exception is thrown.

socialDF = spark.read.format("org.apache.spark.sql.cassandra").options(table="socialmedia", keyspace="accelerate").load()

Exception message:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
/tmp/ipykernel_31/421172659.py in <module>
----> 1 socialDF = spark.read.format("org.apache.spark.sql.cassandra").options(table="socialmedia", keyspace="accelerate").load()

/usr/local/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
    162             return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
    163         else:
--> 164             return self._df(self._jreader.load())
    165 
    166     def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,

/usr/local/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1307 
   1308         answer = self.gateway_client.send_command(command)
-> 1309         return_value = get_return_value(
   1310             answer, self.gateway_client, self.target_id, self.name)
   1311 

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

/usr/local/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o67.load.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.cassandra.DefaultSource$
	at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:55)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)

What I did:

  1. Start the docker-compose
  2. Go to localhost:8888
  3. Run the whole kmeans.ipynp

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.