Giter Site home page Giter Site logo

vericast / spylon Goto Github PK

View Code? Open in Web Editor NEW
40.0 12.0 16.0 2.24 MB

Utilities to work with Scala/Java code with py4j

Home Page: https://maxpoint.github.io/spylon/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
scala python apache-spark team-platform

spylon's People

Contributors

ericdill avatar gforsyth avatar kichik avatar mariusvniekerk avatar markscheevel avatar parente avatar rgbkrk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spylon's Issues

Multiple ways to set configs leads to madness

    def test_config_priority():
        c = sparklauncher.SparkConfiguration()
        c.driver_memory = "4g"
        c.conf.spark.driver.memory = "5g"
        c._set_environment_variables()
>       assert '--driver-memory 5g' in os.environ['PYSPARK_SUBMIT_ARGS']
E       AssertionError: assert '--driver-memory 5g' in '--archives /path/to/some/archive.zip,/path/to/someother/archive.zip --driver-memory 4g pyspark-shell'

tests/test_spark_launcher.py:52: AssertionError

Seems that setting any of the special cased SparkConfiguration._spark_launcher_arg_names in tandem with the fully dotted variable leads to the fully dotted one being superseded by the special cased one. Yuck

Daemonize the progress thread

As it stands, if the main thread throws an exception and wants to exit, the progress thread keeps the process alive because it's a non-daemon thread.

This is a simple thread.daemon=True change in progress.py, but I want to test to make sure it's not going to have other unintended side effects.

stop in run iteration function

import breeze.linalg._
var a = DenseVector(1,5,3,7,4,2,7,8,2,4,6,9,3,3,76,8)
def dif(X:DenseVector[Int] , k:Int){if (k==0) println(X) else {dif(X(1 until X.size) – X(0 until X.size – 1), k-1 ) } }
dif(a,5)

stop in the third sentence

Application stop / start partially works

For my spylon notebook I:

  1. Did mp.sparkSession.stop()
  2. Did not restart the notebook kernel
  3. Ran all the %%init_spark and spark to start up a Spark application again

what I found was that most operations work, like reading datasets using the sparkSession and showing them and stuff

However, when I tried to use the sparkContext, it thinks it's not running. Here's the code I was running and the error:

val bRetailersList = (mp.sparkSession.sparkContext
.broadcast(trainedModel.itemFactors.select("id")
.rdd.map(x => x(0).asInstanceOf[Int]).collect)
)

java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:

org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:745)

The currently active SparkContext was created at:

org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
org.apache.spark.ml.util.BaseReadWrite$class.sparkSession(ReadWrite.scala:69)
org.apache.spark.ml.util.MLReader.sparkSession(ReadWrite.scala:189)
org.apache.spark.ml.util.BaseReadWrite$class.sc(ReadWrite.scala:80)
org.apache.spark.ml.util.MLReader.sc(ReadWrite.scala:189)
org.apache.spark.ml.recommendation.ALSModel$ALSModelReader.load(ALS.scala:317)
org.apache.spark.ml.recommendation.ALSModel$ALSModelReader.load(ALS.scala:311)
org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:227)
org.apache.spark.ml.recommendation.ALSModel$.load(ALS.scala:297)
(:53)
(:58)
(:60)
(:62)
(:64)
(:66)
(:68)
(:70)
(:72)
(:74)
(:76)

at org.apache.spark.SparkContext.assertNotStopped(SparkContext.scala:101)
at org.apache.spark.sql.SparkSession.(SparkSession.scala:80)
at org.apache.spark.sql.SparkSession.(SparkSession.scala:77)
at org.apache.spark.sql.hive.maxpoint.PlatformSparkSession.(PlatformSparkSession.scala:12)
at com.maxpoint.platform.spark.SparkAppSession.createSparkSession(SparkAppSession.java:344)
at com.maxpoint.platform.das.DAS.createSparkSession(DAS.java:105)
at com.maxpoint.platform.iSparkPlatform.sparkSession$lzycompute(iSparkPlatform.scala:21)
at com.maxpoint.platform.iSparkPlatform.sparkSession(iSparkPlatform.scala:21)
... 44 elided

doctr failure

@mariusvniekerk any idea how to fix this doctr failure? https://travis-ci.org/maxpoint/spylon/jobs/235685518

Successfully built doctr cryptography pycparser
Installing collected packages: idna, asn1crypto, pyparsing, packaging, pycparser, cffi, cryptography, doctr
Successfully installed asn1crypto-0.22.0 cffi-1.10.0 cryptography-1.8.1 doctr-1.5.3 idna-2.5 packaging-16.8 pycparser-2.17 pyparsing-2.2.0
Setting git attributes
git config --global user.name 'Doctr (Travis CI)'
git config --global user.email [email protected]
Adding doctr remote
ssh-add /home/travis/.ssh/github_deploy_key
Identity added: /home/travis/.ssh/github_deploy_key (/home/travis/.ssh/github_deploy_key)
git remote add doctr_remote [email protected]:maxpoint/spylon.git
Fetching doctr remote
git fetch doctr_remote
Warning: Permanently added the RSA host key for IP address '192.30.253.112' to the list of known hosts.
From github.com:maxpoint/spylon
 * [new branch]      gh-pages   -> doctr_remote/gh-pages
 * [new branch]      master     -> doctr_remote/master
Checking out gh-pages
git checkout -b gh-pages --track doctr_remote/gh-pages
error: Your local changes to the following files would be overwritten by checkout:
	docs/index.rst
Please, commit your changes or stash them before you can switch branches.
Aborting
M	docs/index.rst
HEAD is now at 6fbb79a... Merge pull request #47 from ericdill/issue-46
The command "if [[ $TRAVIS_PYTHON_VERSION = "3.5" && $TRAVIS_BRANCH = "master" && $TRAVIS_PULL_REQUEST = "false" ]]; then pip install doctr; doctr deploy docs/_build/html; fi;" exited with 1.
Done. Your build exited with 1.

Split "SparkConfiguration" into launcher and runtime configs

It's pretty confusing for the configuration parameters that are required for launching the spark context to be mixed in with the parameters that can be tweaked while the context is active. This is probably an API break so needs to be considered carefully

[request] provide example for yarn configuration

please provide some examples of yarn cluster conectivity

use case:
want to connect to an EMR cluster running spark on YARN mode, running on a public vpc/subnet and allowing ssh (port 22) conections, can open other ports when necessary.

thanks

Endless progress bars for endless (but complete) spark jobs

There have been cases where the Spark StatusTracker believes a job is still in progress when in fact it has completed and returned results to the driver. When this happens (and surely it's some kind of Spark bug that it does), the progress thread continually outputs useless progress information.

It's not clear how to filter out these dead jobs. Minimally we should consider having a way of turning the progress bars off / on at runtime without hacking sc._jsc = None to get the thread to quit. This is clearly stopgap: we should figure out what's causing the bug and get it fixed upstream.

Create simpler way to load a scala class

Due to how we need to make use of the spark class loader there are some pieces that are a little more hidden than desired.

jvm_helpers.classloader.loadClass("com.MyClass")

returns a JavaObject rather than a py4j JavaClass

see if we can make something that makes this a bit easier.

Outdated versioneer.py broken for Python 3.12

Python 3.12 removed the long deprecated configparser.SafeConfigParser class which old versions of versioneer depended on.

Expected Behavior

pip install spylon-kernel correctly installs the package.

Current Behavior

pip install spylon-kernel fails with this output:

...
              File "/tmp/pip-install-7tqi8u7z/spylon_2393384d9e244f009732dc4b8192d5fb/versioneer.py", line 412, in get_config_from_root
                parser = configparser.SafeConfigParser()
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            AttributeError: module 'configparser' has no attribute 'SafeConfigParser'. Did you mean: 'RawConfigParser'?
            [end of output]
        
        note: This error originates from a subprocess, and is likely not a problem with pip.
      error: subprocess-exited-with-error

      × Getting requirements to build wheel did not run successfully.
      │ exit code: 1
      ╰─> See above for output.

Steps to Reproduce

pip install spylon-kernel, using Python versions 3.12 and higher.

Detailed Description

SafeConfigParser had been deprecated since Python 3.2., and renamed to simply ConfigParser. In Python 3.12 it has finally been removed.
The versioneer.py file in this repository has been generated with versioneer version 0.17, which generated code that uses SafeConfigParser.
New versions of versioneer use ConfigParser instead, so generating a new versioneer.py with a more recent version fixes the issue.

See: https://docs.python.org/3/whatsnew/3.2.html#configparser
and: https://docs.python.org/3/whatsnew/3.12.html#removed

Possible Solution

Update the code generated by versioneer to a more recent version.

Test state is leaking between tests

With this test file test_leaky_state.py:

import spylon.spark.launcher as sparklauncher
import os

def test_set_spark_property():
    c = sparklauncher.SparkConfiguration()
    c.driver_memory = "4g"

def test_spark_driver_memory():
    c = sparklauncher.SparkConfiguration()
    c.conf.spark.driver.memory = "5g"
    c._set_environment_variables()
    assert '--driver-memory 5g' in os.environ['PYSPARK_SUBMIT_ARGS']

If I run test_spark_driver_memory by itself, it passes just fine:

$ python run_tests.py tests/test_leaky_state.py -k driver -v
========================================================================================================================================================================= test session starts =========================================================================================================================================================================
platform darwin -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
rootdir: /Users/edill/dev/maxpoint/github/spylon, inifile:
collected 2 items 

tests/test_leaky_state.py::test_spark_driver_memory PASSED

========================================================================================================================================================================= 1 tests deselected ==========================================================================================================================================================================
=============================================================================================================================================================== 1 passed, 1 deselected in 0.29 seconds ================================================================================================================================================================
Name                            Stmts   Miss  Cover
---------------------------------------------------
spylon/__init__.py                  4      0   100%
spylon/_version.py                261    143    45%
spylon/common.py                   62     40    35%
spylon/spark/__init__.py            4      0   100%
spylon/spark/launcher.py          286    126    56%
spylon/spark/progress.py           76     65    14%
spylon/spark/yarn_launcher.py     144    119    17%
tests/test_leaky_state.py          10      2    80%
---------------------------------------------------
TOTAL                             847    495    42%

But if I run the two of these together, notably with test_set_spark_priority first, then test_spark_driver_memory fails

$ python run_tests.py tests/test_leaky_state.py -v
========================================================================================================================================================================= test session starts =========================================================================================================================================================================
platform darwin -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /Users/edill/miniconda/envs/spylon/bin/python
cachedir: .cache
rootdir: /Users/edill/dev/maxpoint/github/spylon, inifile:
collected 2 items 

tests/test_leaky_state.py::test_set_spark_property PASSED
tests/test_leaky_state.py::test_spark_driver_memory FAILED

============================================================================================================================================================================== FAILURES ===============================================================================================================================================================================
______________________________________________________________________________________________________________________________________________________________________ test_spark_driver_memory _______________________________________________________________________________________________________________________________________________________________________

    def test_spark_driver_memory():
        c = sparklauncher.SparkConfiguration()
        c.conf.spark.driver.memory = "5g"
        c._set_environment_variables()
>       assert '--driver-memory 5g' in os.environ['PYSPARK_SUBMIT_ARGS']
E       AssertionError: assert '--driver-memory 5g' in '--driver-memory 4g pyspark-shell'

tests/test_leaky_state.py:12: AssertionError
================================================================================================================================================================= 1 failed, 1 passed in 0.34 seconds ==================================================================================================================================================================
Name                            Stmts   Miss  Cover
---------------------------------------------------
spylon/__init__.py                  4      0   100%
spylon/_version.py                261    143    45%
spylon/common.py                   62     40    35%
spylon/spark/__init__.py            4      0   100%
spylon/spark/launcher.py          286    124    57%
spylon/spark/progress.py           76     65    14%
spylon/spark/yarn_launcher.py     144    119    17%
tests/test_leaky_state.py          10      0   100%
---------------------------------------------------
TOTAL                             847    491    42%

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.