vericast / spylon Goto Github PK
View Code? Open in Web Editor NEWUtilities to work with Scala/Java code with py4j
Home Page: https://maxpoint.github.io/spylon/
License: BSD 3-Clause "New" or "Revised" License
Utilities to work with Scala/Java code with py4j
Home Page: https://maxpoint.github.io/spylon/
License: BSD 3-Clause "New" or "Revised" License
def test_config_priority():
c = sparklauncher.SparkConfiguration()
c.driver_memory = "4g"
c.conf.spark.driver.memory = "5g"
c._set_environment_variables()
> assert '--driver-memory 5g' in os.environ['PYSPARK_SUBMIT_ARGS']
E AssertionError: assert '--driver-memory 5g' in '--archives /path/to/some/archive.zip,/path/to/someother/archive.zip --driver-memory 4g pyspark-shell'
tests/test_spark_launcher.py:52: AssertionError
Seems that setting any of the special cased SparkConfiguration._spark_launcher_arg_names
in tandem with the fully dotted variable leads to the fully dotted one being superseded by the special cased one. Yuck
Just like we can call to_scala_seq we need a from_scala_seq etc
As it stands, if the main thread throws an exception and wants to exit, the progress thread keeps the process alive because it's a non-daemon thread.
This is a simple thread.daemon=True
change in progress.py, but I want to test to make sure it's not going to have other unintended side effects.
import breeze.linalg._
var a = DenseVector(1,5,3,7,4,2,7,8,2,4,6,9,3,3,76,8)
def dif(X:DenseVector[Int] , k:Int){if (k==0) println(X) else {dif(X(1 until X.size) – X(0 until X.size – 1), k-1 ) } }
dif(a,5)
stop in the third sentence
For my spylon notebook I:
what I found was that most operations work, like reading datasets using the sparkSession and showing them and stuff
However, when I tried to use the sparkContext, it thinks it's not running. Here's the code I was running and the error:
val bRetailersList = (mp.sparkSession.sparkContext
.broadcast(trainedModel.itemFactors.select("id")
.rdd.map(x => x(0).asInstanceOf[Int]).collect)
)
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:745)
The currently active SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
org.apache.spark.ml.util.BaseReadWrite$class.sparkSession(ReadWrite.scala:69)
org.apache.spark.ml.util.MLReader.sparkSession(ReadWrite.scala:189)
org.apache.spark.ml.util.BaseReadWrite$class.sc(ReadWrite.scala:80)
org.apache.spark.ml.util.MLReader.sc(ReadWrite.scala:189)
org.apache.spark.ml.recommendation.ALSModel$ALSModelReader.load(ALS.scala:317)
org.apache.spark.ml.recommendation.ALSModel$ALSModelReader.load(ALS.scala:311)
org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:227)
org.apache.spark.ml.recommendation.ALSModel$.load(ALS.scala:297)
(:53)
(:58)
(:60)
(:62)
(:64)
(:66)
(:68)
(:70)
(:72)
(:74)
(:76)
at org.apache.spark.SparkContext.assertNotStopped(SparkContext.scala:101)
at org.apache.spark.sql.SparkSession.(SparkSession.scala:80)
at org.apache.spark.sql.SparkSession.(SparkSession.scala:77)
at org.apache.spark.sql.hive.maxpoint.PlatformSparkSession.(PlatformSparkSession.scala:12)
at com.maxpoint.platform.spark.SparkAppSession.createSparkSession(SparkAppSession.java:344)
at com.maxpoint.platform.das.DAS.createSparkSession(DAS.java:105)
at com.maxpoint.platform.iSparkPlatform.sparkSession$lzycompute(iSparkPlatform.scala:21)
at com.maxpoint.platform.iSparkPlatform.sparkSession(iSparkPlatform.scala:21)
... 44 elided
@mariusvniekerk any idea how to fix this doctr failure? https://travis-ci.org/maxpoint/spylon/jobs/235685518
Successfully built doctr cryptography pycparser
Installing collected packages: idna, asn1crypto, pyparsing, packaging, pycparser, cffi, cryptography, doctr
Successfully installed asn1crypto-0.22.0 cffi-1.10.0 cryptography-1.8.1 doctr-1.5.3 idna-2.5 packaging-16.8 pycparser-2.17 pyparsing-2.2.0
Setting git attributes
git config --global user.name 'Doctr (Travis CI)'
git config --global user.email [email protected]
Adding doctr remote
ssh-add /home/travis/.ssh/github_deploy_key
Identity added: /home/travis/.ssh/github_deploy_key (/home/travis/.ssh/github_deploy_key)
git remote add doctr_remote [email protected]:maxpoint/spylon.git
Fetching doctr remote
git fetch doctr_remote
Warning: Permanently added the RSA host key for IP address '192.30.253.112' to the list of known hosts.
From github.com:maxpoint/spylon
* [new branch] gh-pages -> doctr_remote/gh-pages
* [new branch] master -> doctr_remote/master
Checking out gh-pages
git checkout -b gh-pages --track doctr_remote/gh-pages
error: Your local changes to the following files would be overwritten by checkout:
docs/index.rst
Please, commit your changes or stash them before you can switch branches.
Aborting
M docs/index.rst
HEAD is now at 6fbb79a... Merge pull request #47 from ericdill/issue-46
The command "if [[ $TRAVIS_PYTHON_VERSION = "3.5" && $TRAVIS_BRANCH = "master" && $TRAVIS_PULL_REQUEST = "false" ]]; then pip install doctr; doctr deploy docs/_build/html; fi;" exited with 1.
Done. Your build exited with 1.
It's pretty confusing for the configuration parameters that are required for launching the spark context to be mixed in with the parameters that can be tweaked while the context is active. This is probably an API break so needs to be considered carefully
We would like to have spylon to work with Spark 3.0.0 with scala 2.12.
Can anyone please outline the things that need to be changed or enhanced in order for it to work?
I am guessing that there is a hardcoded Maxpoint-ism in this code base with respect to the run_pyspark_yarn_cluster function. The env_dir
parameter is unused in that function and there seems to be a hardcoded CONDA
variable in there. I'm guessing this is unpacked by one of our internal tools.
please provide some examples of yarn cluster conectivity
use case:
want to connect to an EMR cluster running spark on YARN mode, running on a public vpc/subnet and allowing ssh (port 22) conections, can open other ports when necessary.
thanks
There have been cases where the Spark StatusTracker believes a job is still in progress when in fact it has completed and returned results to the driver. When this happens (and surely it's some kind of Spark bug that it does), the progress thread continually outputs useless progress information.
It's not clear how to filter out these dead jobs. Minimally we should consider having a way of turning the progress bars off / on at runtime without hacking sc._jsc = None
to get the thread to quit. This is clearly stopgap: we should figure out what's causing the bug and get it fixed upstream.
The test_progressbar
test in test_spark_docker.py
seems to fail randomly on my mac if I run the tests outside of the docker container. Not sure if this is a known issue or not. @mariusvniekerk ?
Due to how we need to make use of the spark class loader there are some pieces that are a little more hidden than desired.
jvm_helpers.classloader.loadClass("com.MyClass")
returns a JavaObject rather than a py4j JavaClass
see if we can make something that makes this a bit easier.
Python 3.12 removed the long deprecated configparser.SafeConfigParser class which old versions of versioneer depended on.
pip install spylon-kernel
correctly installs the package.
pip install spylon-kernel
fails with this output:
...
File "/tmp/pip-install-7tqi8u7z/spylon_2393384d9e244f009732dc4b8192d5fb/versioneer.py", line 412, in get_config_from_root
parser = configparser.SafeConfigParser()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'configparser' has no attribute 'SafeConfigParser'. Did you mean: 'RawConfigParser'?
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
pip install spylon-kernel
, using Python versions 3.12 and higher.
SafeConfigParser had been deprecated since Python 3.2., and renamed to simply ConfigParser. In Python 3.12 it has finally been removed.
The versioneer.py file in this repository has been generated with versioneer version 0.17, which generated code that uses SafeConfigParser.
New versions of versioneer use ConfigParser instead, so generating a new versioneer.py with a more recent version fixes the issue.
See: https://docs.python.org/3/whatsnew/3.2.html#configparser
and: https://docs.python.org/3/whatsnew/3.12.html#removed
Update the code generated by versioneer to a more recent version.
PR #35 removed the docker build approach in favor of a conda environment. Something in the move broke doctr deployment. The deploy key is not found. Probably a path change or something we need to config.
With this test file test_leaky_state.py
:
import spylon.spark.launcher as sparklauncher
import os
def test_set_spark_property():
c = sparklauncher.SparkConfiguration()
c.driver_memory = "4g"
def test_spark_driver_memory():
c = sparklauncher.SparkConfiguration()
c.conf.spark.driver.memory = "5g"
c._set_environment_variables()
assert '--driver-memory 5g' in os.environ['PYSPARK_SUBMIT_ARGS']
If I run test_spark_driver_memory by itself, it passes just fine:
$ python run_tests.py tests/test_leaky_state.py -k driver -v
========================================================================================================================================================================= test session starts =========================================================================================================================================================================
platform darwin -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
rootdir: /Users/edill/dev/maxpoint/github/spylon, inifile:
collected 2 items
tests/test_leaky_state.py::test_spark_driver_memory PASSED
========================================================================================================================================================================= 1 tests deselected ==========================================================================================================================================================================
=============================================================================================================================================================== 1 passed, 1 deselected in 0.29 seconds ================================================================================================================================================================
Name Stmts Miss Cover
---------------------------------------------------
spylon/__init__.py 4 0 100%
spylon/_version.py 261 143 45%
spylon/common.py 62 40 35%
spylon/spark/__init__.py 4 0 100%
spylon/spark/launcher.py 286 126 56%
spylon/spark/progress.py 76 65 14%
spylon/spark/yarn_launcher.py 144 119 17%
tests/test_leaky_state.py 10 2 80%
---------------------------------------------------
TOTAL 847 495 42%
But if I run the two of these together, notably with test_set_spark_priority
first, then test_spark_driver_memory
fails
$ python run_tests.py tests/test_leaky_state.py -v
========================================================================================================================================================================= test session starts =========================================================================================================================================================================
platform darwin -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /Users/edill/miniconda/envs/spylon/bin/python
cachedir: .cache
rootdir: /Users/edill/dev/maxpoint/github/spylon, inifile:
collected 2 items
tests/test_leaky_state.py::test_set_spark_property PASSED
tests/test_leaky_state.py::test_spark_driver_memory FAILED
============================================================================================================================================================================== FAILURES ===============================================================================================================================================================================
______________________________________________________________________________________________________________________________________________________________________ test_spark_driver_memory _______________________________________________________________________________________________________________________________________________________________________
def test_spark_driver_memory():
c = sparklauncher.SparkConfiguration()
c.conf.spark.driver.memory = "5g"
c._set_environment_variables()
> assert '--driver-memory 5g' in os.environ['PYSPARK_SUBMIT_ARGS']
E AssertionError: assert '--driver-memory 5g' in '--driver-memory 4g pyspark-shell'
tests/test_leaky_state.py:12: AssertionError
================================================================================================================================================================= 1 failed, 1 passed in 0.34 seconds ==================================================================================================================================================================
Name Stmts Miss Cover
---------------------------------------------------
spylon/__init__.py 4 0 100%
spylon/_version.py 261 143 45%
spylon/common.py 62 40 35%
spylon/spark/__init__.py 4 0 100%
spylon/spark/launcher.py 286 124 57%
spylon/spark/progress.py 76 65 14%
spylon/spark/yarn_launcher.py 144 119 17%
tests/test_leaky_state.py 10 0 100%
---------------------------------------------------
TOTAL 847 491 42%
Would be nice if you didn't have to build the docs yourself.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.