vericast / spylon Goto Github PK

View Code? Open in Web Editor NEW

40.0 12.0 16.0 2.24 MB

Utilities to work with Scala/Java code with py4j

Home Page: https://maxpoint.github.io/spylon/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

scala python apache-spark team-platform

spylon's People

Contributors

Stargazers

Watchers

Forkers

mariusvniekerk parente markscheevel rgbkrk rowhit gforsyth patrick-nicholson kichik psyking841 pombredanne ogweno doytsujin chernov-ca nielsreijers kannankvsp

spylon's Issues

Multiple ways to set configs leads to madness

    def test_config_priority():
        c = sparklauncher.SparkConfiguration()
        c.driver_memory = "4g"
        c.conf.spark.driver.memory = "5g"
        c._set_environment_variables()
>       assert '--driver-memory 5g' in os.environ['PYSPARK_SUBMIT_ARGS']
E       AssertionError: assert '--driver-memory 5g' in '--archives /path/to/some/archive.zip,/path/to/someother/archive.zip --driver-memory 4g pyspark-shell'

tests/test_spark_launcher.py:52: AssertionError

Seems that setting any of the special cased SparkConfiguration._spark_launcher_arg_names in tandem with the fully dotted variable leads to the fully dotted one being superseded by the special cased one. Yuck

Create converters from Scala Types to Python Types

Just like we can call to_scala_seq we need a from_scala_seq etc

Daemonize the progress thread

As it stands, if the main thread throws an exception and wants to exit, the progress thread keeps the process alive because it's a non-daemon thread.

This is a simple thread.daemon=True change in progress.py, but I want to test to make sure it's not going to have other unintended side effects.

stop in run iteration function

import breeze.linalg._
var a = DenseVector(1,5,3,7,4,2,7,8,2,4,6,9,3,3,76,8)
def dif(X:DenseVector[Int] , k:Int){if (k==0) println(X) else {dif(X(1 until X.size) – X(0 until X.size – 1), k-1 ) } }
dif(a,5)

stop in the third sentence

Application stop / start partially works

For my spylon notebook I:

Did mp.sparkSession.stop()
Did not restart the notebook kernel
Ran all the %%init_spark and spark to start up a Spark application again

what I found was that most operations work, like reading datasets using the sparkSession and showing them and stuff

However, when I tried to use the sparkContext, it thinks it's not running. Here's the code I was running and the error:

val bRetailersList = (mp.sparkSession.sparkContext
.broadcast(trainedModel.itemFactors.select("id")
.rdd.map(x => x(0).asInstanceOf[Int]).collect)
)

java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:

org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:745)

The currently active SparkContext was created at:

org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
org.apache.spark.ml.util.BaseReadWrite$class.sparkSession(ReadWrite.scala:69)
org.apache.spark.ml.util.MLReader.sparkSession(ReadWrite.scala:189)
org.apache.spark.ml.util.BaseReadWrite$class.sc(ReadWrite.scala:80)
org.apache.spark.ml.util.MLReader.sc(ReadWrite.scala:189)
org.apache.spark.ml.recommendation.ALSModel$ALSModelReader.load(ALS.scala:317)
org.apache.spark.ml.recommendation.ALSModel$ALSModelReader.load(ALS.scala:311)
org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:227)
org.apache.spark.ml.recommendation.ALSModel$.load(ALS.scala:297)
(:53)
(:58)
(:60)
(:62)
(:64)
(:66)
(:68)
(:70)
(:72)
(:74)
(:76)

at org.apache.spark.SparkContext.assertNotStopped(SparkContext.scala:101)
at org.apache.spark.sql.SparkSession.(SparkSession.scala:80)
at org.apache.spark.sql.SparkSession.(SparkSession.scala:77)
at org.apache.spark.sql.hive.maxpoint.PlatformSparkSession.(PlatformSparkSession.scala:12)
at com.maxpoint.platform.spark.SparkAppSession.createSparkSession(SparkAppSession.java:344)
at com.maxpoint.platform.das.DAS.createSparkSession(DAS.java:105)
at com.maxpoint.platform.iSparkPlatform.sparkSession$lzycompute(iSparkPlatform.scala:21)
at com.maxpoint.platform.iSparkPlatform.sparkSession(iSparkPlatform.scala:21)
... 44 elided

doctr failure

@mariusvniekerk any idea how to fix this doctr failure? https://travis-ci.org/maxpoint/spylon/jobs/235685518

Successfully built doctr cryptography pycparser
Installing collected packages: idna, asn1crypto, pyparsing, packaging, pycparser, cffi, cryptography, doctr
Successfully installed asn1crypto-0.22.0 cffi-1.10.0 cryptography-1.8.1 doctr-1.5.3 idna-2.5 packaging-16.8 pycparser-2.17 pyparsing-2.2.0
Setting git attributes
git config --global user.name 'Doctr (Travis CI)'
git config --global user.email [email protected]
Adding doctr remote
ssh-add /home/travis/.ssh/github_deploy_key
Identity added: /home/travis/.ssh/github_deploy_key (/home/travis/.ssh/github_deploy_key)
git remote add doctr_remote [email protected]:maxpoint/spylon.git
Fetching doctr remote
git fetch doctr_remote
Warning: Permanently added the RSA host key for IP address '192.30.253.112' to the list of known hosts.
From github.com:maxpoint/spylon
 * [new branch]      gh-pages   -> doctr_remote/gh-pages
 * [new branch]      master     -> doctr_remote/master
Checking out gh-pages
git checkout -b gh-pages --track doctr_remote/gh-pages
error: Your local changes to the following files would be overwritten by checkout:
	docs/index.rst
Please, commit your changes or stash them before you can switch branches.
Aborting
M	docs/index.rst
HEAD is now at 6fbb79a... Merge pull request #47 from ericdill/issue-46
The command "if [[ $TRAVIS_PYTHON_VERSION = "3.5" && $TRAVIS_BRANCH = "master" && $TRAVIS_PULL_REQUEST = "false" ]]; then pip install doctr; doctr deploy docs/_build/html; fi;" exited with 1.
Done. Your build exited with 1.

Configure Travis

Split "SparkConfiguration" into launcher and runtime configs

It's pretty confusing for the configuration parameters that are required for launching the spark context to be mixed in with the parameters that can be tweaked while the context is active. This is probably an API break so needs to be considered carefully

How to make spylon work with Spark 3.0.0 with scala 2.12?

We would like to have spylon to work with Spark 3.0.0 with scala 2.12.
Can anyone please outline the things that need to be changed or enhanced in order for it to work?

Why is `env_dir` an unused parameter for `run_pyspark_yarn_cluster`

I am guessing that there is a hardcoded Maxpoint-ism in this code base with respect to the run_pyspark_yarn_cluster function. The env_dir parameter is unused in that function and there seems to be a hardcoded CONDA variable in there. I'm guessing this is unpacked by one of our internal tools.

[request] provide example for yarn configuration

please provide some examples of yarn cluster conectivity

use case:
want to connect to an EMR cluster running spark on YARN mode, running on a public vpc/subnet and allowing ssh (port 22) conections, can open other ports when necessary.

thanks

Endless progress bars for endless (but complete) spark jobs

There have been cases where the Spark StatusTracker believes a job is still in progress when in fact it has completed and returned results to the driver. When this happens (and surely it's some kind of Spark bug that it does), the progress thread continually outputs useless progress information.

It's not clear how to filter out these dead jobs. Minimally we should consider having a way of turning the progress bars off / on at runtime without hacking sc._jsc = None to get the thread to quit. This is clearly stopgap: we should figure out what's causing the bug and get it fixed upstream.

Intermittent test failures on test_spark_docker.py:test_progressbar

The test_progressbar test in test_spark_docker.py seems to fail randomly on my mac if I run the tests outside of the docker container. Not sure if this is a known issue or not. @mariusvniekerk ?

Create simpler way to load a scala class

Due to how we need to make use of the spark class loader there are some pieces that are a little more hidden than desired.

jvm_helpers.classloader.loadClass("com.MyClass")

returns a JavaObject rather than a py4j JavaClass

see if we can make something that makes this a bit easier.

Outdated versioneer.py broken for Python 3.12

Python 3.12 removed the long deprecated configparser.SafeConfigParser class which old versions of versioneer depended on.

Expected Behavior

pip install spylon-kernel correctly installs the package.

Current Behavior

pip install spylon-kernel fails with this output:

...
              File "/tmp/pip-install-7tqi8u7z/spylon_2393384d9e244f009732dc4b8192d5fb/versioneer.py", line 412, in get_config_from_root
                parser = configparser.SafeConfigParser()
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            AttributeError: module 'configparser' has no attribute 'SafeConfigParser'. Did you mean: 'RawConfigParser'?
            [end of output]
        
        note: This error originates from a subprocess, and is likely not a problem with pip.
      error: subprocess-exited-with-error

      × Getting requirements to build wheel did not run successfully.
      │ exit code: 1
      ╰─> See above for output.

Steps to Reproduce

pip install spylon-kernel, using Python versions 3.12 and higher.

Detailed Description

SafeConfigParser had been deprecated since Python 3.2., and renamed to simply ConfigParser. In Python 3.12 it has finally been removed.
The versioneer.py file in this repository has been generated with versioneer version 0.17, which generated code that uses SafeConfigParser.
New versions of versioneer use ConfigParser instead, so generating a new versioneer.py with a more recent version fixes the issue.

See: https://docs.python.org/3/whatsnew/3.2.html#configparser
and: https://docs.python.org/3/whatsnew/3.12.html#removed

Possible Solution

Update the code generated by versioneer to a more recent version.

doctr deploy is broken on travis

PR #35 removed the docker build approach in favor of a conda environment. Something in the move broke doctr deployment. The deploy key is not found. Probably a path change or something we need to config.

https://travis-ci.org/maxpoint/spylon/jobs/235480576

Test state is leaking between tests

With this test file test_leaky_state.py:

import spylon.spark.launcher as sparklauncher
import os

def test_set_spark_property():
    c = sparklauncher.SparkConfiguration()
    c.driver_memory = "4g"

def test_spark_driver_memory():
    c = sparklauncher.SparkConfiguration()
    c.conf.spark.driver.memory = "5g"
    c._set_environment_variables()
    assert '--driver-memory 5g' in os.environ['PYSPARK_SUBMIT_ARGS']

If I run test_spark_driver_memory by itself, it passes just fine:

$ python run_tests.py tests/test_leaky_state.py -k driver -v
========================================================================================================================================================================= test session starts =========================================================================================================================================================================
platform darwin -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
rootdir: /Users/edill/dev/maxpoint/github/spylon, inifile:
collected 2 items 

tests/test_leaky_state.py::test_spark_driver_memory PASSED

========================================================================================================================================================================= 1 tests deselected ==========================================================================================================================================================================
=============================================================================================================================================================== 1 passed, 1 deselected in 0.29 seconds ================================================================================================================================================================
Name                            Stmts   Miss  Cover
---------------------------------------------------
spylon/__init__.py                  4      0   100%
spylon/_version.py                261    143    45%
spylon/common.py                   62     40    35%
spylon/spark/__init__.py            4      0   100%
spylon/spark/launcher.py          286    126    56%
spylon/spark/progress.py           76     65    14%
spylon/spark/yarn_launcher.py     144    119    17%
tests/test_leaky_state.py          10      2    80%
---------------------------------------------------
TOTAL                             847    495    42%

But if I run the two of these together, notably with test_set_spark_priority first, then test_spark_driver_memory fails

$ python run_tests.py tests/test_leaky_state.py -v
========================================================================================================================================================================= test session starts =========================================================================================================================================================================
platform darwin -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /Users/edill/miniconda/envs/spylon/bin/python
cachedir: .cache
rootdir: /Users/edill/dev/maxpoint/github/spylon, inifile:
collected 2 items 

tests/test_leaky_state.py::test_set_spark_property PASSED
tests/test_leaky_state.py::test_spark_driver_memory FAILED

============================================================================================================================================================================== FAILURES ===============================================================================================================================================================================
______________________________________________________________________________________________________________________________________________________________________ test_spark_driver_memory _______________________________________________________________________________________________________________________________________________________________________

    def test_spark_driver_memory():
        c = sparklauncher.SparkConfiguration()
        c.conf.spark.driver.memory = "5g"
        c._set_environment_variables()
>       assert '--driver-memory 5g' in os.environ['PYSPARK_SUBMIT_ARGS']
E       AssertionError: assert '--driver-memory 5g' in '--driver-memory 4g pyspark-shell'

tests/test_leaky_state.py:12: AssertionError
================================================================================================================================================================= 1 failed, 1 passed in 0.34 seconds ==================================================================================================================================================================
Name                            Stmts   Miss  Cover
---------------------------------------------------
spylon/__init__.py                  4      0   100%
spylon/_version.py                261    143    45%
spylon/common.py                   62     40    35%
spylon/spark/__init__.py            4      0   100%
spylon/spark/launcher.py          286    124    57%
spylon/spark/progress.py           76     65    14%
spylon/spark/yarn_launcher.py     144    119    17%
tests/test_leaky_state.py          10      0   100%
---------------------------------------------------
TOTAL                             847    491    42%

Put docs on ReadTheDocs

Would be nice if you didn't have to build the docs yourself.