jupyter-incubator / sparkmagic Goto Github PK

Jupyter magics and kernels for working with remote Spark clusters

License: Other

Python 98.03% JavaScript 0.15% Jupyter Notebook 1.82%

spark kernel cluster livy magic sql-query pandas-dataframe jupyter pyspark kerberos

sparkmagic's Introduction

sparkmagic

Sparkmagic is a set of tools for interactively working with remote Spark clusters in Jupyter notebooks. Sparkmagic interacts with remote Spark clusters through a REST server. Currently there are three server implementations compatible with Spararkmagic:

Livy - for running interactive sessions on Yarn
Lighter - for running interactive sessions on Yarn or Kubernetes (only PySpark sessions are supported)
Ilum - for running interactive sessions on Yarn or Kubernetes

The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment.

Features

Run Spark code in multiple languages against any remote Spark cluster through Livy
Automatic SparkContext (sc) and HiveContext (sqlContext) creation
Easily execute SparkSQL queries with the %%sql magic
Automatic visualization of SQL queries in the PySpark, Spark and SparkR kernels; use an easy visual interface to interactively construct visualizations, no code required
Easy access to Spark application information and logs (%%info magic)
Ability to capture the output of SQL queries as Pandas dataframes to interact with other Python libraries (e.g. matplotlib)
Send local files or dataframes to a remote cluster (e.g. sending pretrained local ML model straight to the Spark cluster)
Authenticate to Livy via Basic Access authentication or via Kerberos

Examples

There are two ways to use sparkmagic. Head over to the examples section for a demonstration on how to use both models of execution.

1. Via the IPython kernel

The sparkmagic library provides a %%spark magic that you can use to easily run code against a remote Spark cluster from a normal IPython notebook. See the Spark Magics on IPython sample notebook

2. Via the PySpark and Spark kernels

The sparkmagic library also provides a set of Scala and Python kernels that allow you to automatically connect to a remote Spark cluster, run code and SQL queries, manage your Livy server and Spark job configuration, and generate automatic visualizations. See Pyspark and Spark sample notebooks.

3. Sending local data to Spark Kernel

See the Sending Local Data to Spark notebook.

Installation

Install the library
```
 pip install sparkmagic
```

Make sure that ipywidgets is properly installed by running

 jupyter nbextension enable --py --sys-prefix widgetsnbextension

If you're using JupyterLab, you'll need to run another command:

 jupyter labextension install "@jupyter-widgets/jupyterlab-manager"

(Optional) Install the wrapper kernels. Do pip show sparkmagic and it will show the path where sparkmagic is installed at. cd to that location and do:

 jupyter-kernelspec install sparkmagic/kernels/sparkkernel
 jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
 jupyter-kernelspec install sparkmagic/kernels/sparkrkernel

(Optional) Modify the configuration file at ~/.sparkmagic/config.json. Look at the example_config.json
(Optional) Enable the server extension so that clusters can be programatically changed:
```
 jupyter serverextension enable --py sparkmagic
```

Authentication Methods

Sparkmagic supports:

No auth
Basic authentication
Kerberos

The Authenticator is the mechanism for authenticating to Livy. The base Authenticator used by itself supports no auth, but it can be subclassed to enable authentication via other methods. Two such examples are the Basic and Kerberos Authenticators.

Kerberos Authenticator

Kerberos support is implemented via the requests-kerberos package. Sparkmagic expects a kerberos ticket to be available in the system. Requests-kerberos will pick up the kerberos ticket from a cache file. For the ticket to be available, the user needs to have run kinit to create the kerberos ticket.

Kerberos Configuration

By default the HTTPKerberosAuth constructor provided by the requests-kerberos package will use the following configuration

HTTPKerberosAuth(mutual_authentication=REQUIRED)

but this will not be right configuration for every context, so it is able to pass custom arguments for this constructor using the following configuration on the ~/.sparkmagic/config.json

{
    "kerberos_auth_configuration": {
        "mutual_authentication": 1,
        "service": "HTTP",
        "delegate": false,
        "force_preemptive": false,
        "principal": "principal",
        "hostname_override": "hostname_override",
        "sanitize_mutual_error_response": true,
        "send_cbt": true
    }
}

Custom Authenticators

You can write custom Authenticator subclasses to enable authentication via other mechanisms. All Authenticator subclasses should override the Authenticator.__call__(request) method that attaches HTTP Authentication to the given Request object.

Authenticator subclasses that add additional class attributes to be used for the authentication, such as the [Basic] (sparkmagic/sparkmagic/auth/basic.py) authenticator which adds username and password attributes, should override the __hash__, __eq__, update_with_widget_values, and get_widgets methods to work with these new attributes. This is necessary in order for the Authenticator to use these attributes in the authentication process.

Using a Custom Authenticator with Sparkmagic

If your repository layout is:

    .
    ├── LICENSE
    ├── README.md
    ├── customauthenticator
    │   ├── __init__.py 
    │   ├── customauthenticator.py 
    └── setup.py

Then to pip install from this repository, run: pip install git+https://git_repo_url/#egg=customauthenticator

After installing, you need to register the custom authenticator with Sparkmagic so it can be dynamically imported. This can be done in two different ways:

Edit the configuration file at ~/.sparkmagic/config.json with the following settings:
```
{
    "authenticators": {
        "Kerberos": "sparkmagic.auth.kerberos.Kerberos",
        "None": "sparkmagic.auth.customauth.Authenticator",
        "Basic_Access": "sparkmagic.auth.basic.Basic",
        "Custom_Auth": "customauthenticator.customauthenticator.CustomAuthenticator"
  }
}
```
This adds your CustomAuthenticator class in customauthenticator.py to Sparkmagic. Custom_Auth is the authentication type that will be displayed in the %manage_spark widget's Auth type dropdown as well as the Auth type passed as an argument to the -t flag in the %spark add session magic.

Modify the authenticators method in sparkmagic/utils/configuration.py to return your custom authenticator:

def authenticators():
        return {
                u"Kerberos": u"sparkmagic.auth.kerberos.Kerberos",
                u"None": u"sparkmagic.auth.customauth.Authenticator",
                u"Basic_Access": u"sparkmagic.auth.basic.Basic", 
                u"Custom_Auth": u"customauthenticator.customauthenticator.CustomAuthenticator"
        }

Spark config settings

There are two config options for spark settings session_configs_defaults and session_configs. session_configs_defaults sets default setting that have to be explicitly overidden in order for a user to change them. session_configs provides defaults that are all replaced whenever a user changes them using the configure magic.

HTTP Session Adapters

If you need to customize HTTP request behavior for specific domains by modifying headers, implementing custom logic (e.g., using mTLS, retrying requests), or handling them differently, you can use a custom adapter to gain fine-grained control over request processing.

More details on how we can configure and use http adapter can be found here

For configuring custom http adapter, edit the ~/.sparkmagic/config.json with the following settings:

  "http_session_config": {
    "adapters":
      [
        {
          "prefix": "http://",
          "adapter": "customadapter.customadapter.CustomaAapter"
        }
      ]
  },

This adds your CustomaAapter class in customadapter.py to sparkmagic http livy-requests session.

Papermill

If you want Papermill rendering to stop on a Spark error, edit the ~/.sparkmagic/config.json with the following settings:

{
    "shutdown_session_on_spark_statement_errors": true,
    "all_errors_are_fatal": true
}

If you want any registered livy sessions to be cleaned up on exit regardless of whether the process exits gracefully or not, you can set:

{
    "cleanup_all_sessions_on_exit": true,
    "all_errors_are_fatal": true
}

Conf overrides in code

In addition to the conf at ~/.sparkmagic/config.json, sparkmagic conf can be overridden programmatically in a notebook.

For example:

import sparkmagic.utils.configuration as conf
conf.override('cleanup_all_sessions_on_exit', True)

Same thing, but referencing the conf member:

conf.override(conf.cleanup_all_sessions_on_exit.__name__, True)

NOTE: override for cleanup_all_sessions_on_exit must be set before initializing sparkmagic ie. before this:

%load_ext sparkmagic.magics

Docker

The included docker-compose.yml file will let you spin up a full sparkmagic stack that includes a Jupyter notebook with the appropriate extensions installed, and a Livy server backed by a local-mode Spark instance. (This is just for testing and developing sparkmagic itself; in reality, sparkmagic is not very useful if your Spark instance is on the same machine!)

In order to use it, make sure you have Docker and Docker Compose both installed, and then simply run:

docker compose build
docker compose up

You will then be able to access the Jupyter notebook in your browser at http://localhost:8888. Inside this notebook, you can configure a sparkmagic endpoint at http://spark:8998. This endpoint is able to launch both Scala and Python sessions. You can also choose to start a wrapper kernel for Scala, Python, or R from the list of kernels.

To shut down the containers, you can interrupt docker compose with Ctrl-C, and optionally remove the containers with docker compose down.

If you are developing sparkmagic and want to test out your changes in the Docker container without needing to push a version to PyPI, you can set the dev_mode build arg in docker-compose.yml to true, and then re-build the container. This will cause the container to install your local version of autovizwidget, hdijupyterutils, and sparkmagic. The local packages are installed with the editable flag, meaning you can make edits directly to the libraries within the Jupyterlab docker service to debug issues in realtime. To make local changes available in Jupyterlab, make sure to re-run docker compose build before spinning up the services.

Server extension API

`/reconnectsparkmagic`:

POST: Allows to specify Spark cluster connection information to a notebook passing in the notebook path and cluster information. Kernel will be started/restarted and connected to cluster specified.

Request Body example: { 'path': 'path.ipynb', 'username': 'username', 'password': 'password', 'endpoint': 'url', 'auth': 'Kerberos', 'kernelname': 'pysparkkernel' }

Note that the auth can be either None, Basic_Access or Kerberos based on the authentication enabled in livy. The kernelname parameter is optional and defaults to the one specified on the config file or pysparkkernel if not on the config file. Returns 200 if successful; 400 if body is not JSON string or key is not found; 500 if error is encountered changing clusters.

Reply Body example: { 'success': true, 'error': null }

Architecture

Sparkmagic uses Livy, a REST server for Spark, to remotely execute all user code. The library then automatically collects the output of your code as plain text or a JSON document, displaying the results to you as formatted text or as a Pandas dataframe as appropriate.

This architecture offers us some important advantages:

Run Spark code completely remotely; no Spark components need to be installed on the Jupyter server
Multi-language support; the Python, Python3, Scala and R kernels are equally feature-rich, and adding support for more languages will be easy
Support for multiple endpoints; you can use a single notebook to start multiple Spark jobs in different languages and against different remote clusters
Easy integration with any Python library for data science or visualization, like Pandas or Plotly

However, there are some important limitations to note:

Some overhead added by sending all code and output through Livy
Since all code is run on a remote driver through Livy, all structured data must be serialized to JSON and parsed by the Sparkmagic library so that it can be manipulated and visualized on the client side. In practice this means that you must use Python for client-side data manipulation in %%local mode.

Contributing

We welcome contributions from everyone. If you've made an improvement to our code, please send us a pull request.

To dev install, execute the following:

Clone the repo

git clone https://github.com/jupyter-incubator/sparkmagic

Install local versions of packages

pip install -e hdijupyterutils 
pip install -e autovizwidget
pip install -e sparkmagic

Alternatively, you can use Poetry to setup a virtual environment

poetry install
# If you run into issues install numpy or pandas, run
# poetry run pip install numpy pandas
# then re-run poetry install

Run unit tests, with pytest

# if you don't have pytest and mock installed, run
# pip install pytest mock
pytest

If you installed packages with Poetry, run

poetry run pytest

If you want to see an enhancement made but don't have time to work on it yourself, feel free to submit an issue for us to deal with.

sparkmagic's People

Contributors

Stargazers

Watchers

Forkers

alope107 msftristew ellisonbg wangxiong2015 mohamedelkamhawy gdtm86 cfregly mindis alex-the-man ganeshraju ganeshrajulinaro linchan825 pkasinathan aggftw digideskio gsemet praveen-kanamarlapudi jeffersonezra renozhang sjl421 osorensen projectafey simonsun1028 ckadner orestis-ms pranayhasan dongshengmu milledragon carlosandres12 colinsongf niket709 john-min ghosthamlet dzinsouhwx apetresc joychak vidit-bhatia a-romero nubiofs vlaskinvlad jakhani karuppayya yarikc kroq-gar78 nikolayvoronchikhin curtishoward raoxl pawanrana saulshanabrook jeffrodriguez sangramga yunxileo lzbgt experimentaccount0 lorenzoeusepi77 wangheng19900315 lilloraffa jseabold harschware tomaszdudek7 akashbing jamesmarswang aymar73 engineerkhan cokoso kcompher mrunmayeejog ezhaar ktcrisis taoliseki mtnwni www3838438 b11z andy-esch hidhineshraja shubhampachori12110095 xufeiyuan edoardovivo wangqiaoshi polomarcus xuande mpekalski allensmile baldwint cgosorio tps-supership dbrambilla rahulbhasin83 lanfeng12 batermj tsuki silmor lev112 ruoyuw datalayer-externals kbalde ckbhatt sekharvth sharonmy shoelsch

sparkmagic's Issues

Syntax highlighting

For Scala and R support, we may want to look at http://pygments.org/languages/

For SQL support, look at @alope107's solution in alope107/spark-sql-magic@34b4d86

Shutdown pyspark kernel doesn't gaurantee that session has been deleted

Repro Steps:

Open Pyspark kernel
Do any operation, 1+1 for example
shutdown it before getting the answer (while creating the SQL context & Hive one)
SSH the cluster, you'll find the session still existing

For now, to delete it, you have to do that manually from ssh the cluster.

Does Azure HDInsights have livy and sparkmagic working in the jupyter notebook?

Hi there,

I'm looking for a solution that works with jupyter notebook via livy to use Spark. It sesm that sparkmagic is a good fit that it. I wonder if Azure HDInsights have this service build-in?

Best Regards,

Jerry

`ipywidgets.FlexBox` deprecated in version 5.0

I think the current version of ipywidgets is 4.1.1 so this is not a pressing issue (it's not deprecated yet) but for future-proofing we should consider moving the autovizwidget way from that model.

Give pandas df back to user when user runs sql query

The pandas df being constructed is not being passed back to the user for the user to play with.

So, user does something like:

%spark -c sql SELECT * FROM table

and even though the result is being constructed into a pandas df, user cannot visualize it.

Instead, a user could do something like:

%spark -c sql -v myDf SELECT * FROM table

and result could be available in myDf.

We should discuss the syntax before implementing this. Pinging @ellisonbg

Integrate auto viz magic with wrapper kernel

Print progress while session is being created

Give the user some kind of feedback so that user knows that code is not frozen.

SQL queries are not escaped properly

From livyclient.py:

def execute_sql(self, command):
    return self.execute('sqlContext.sql("{}").collect()'.format(command))

def execute_hive(self, command):
    return self.execute('hiveContext.sql("{}").collect()'.format(command))

If the SQL query the user passes in has double-quotes in it (double-quotes are a valid string delimiter in Spark SQL), then this is liable to cause an error. Moreover, if the query happens to have a stray " in it (maybe as a result of user error), then this will cause a syntax error in the rest of the running code.

Allow user to return dataframes not constructed through SQL queries

A thought I just had: Currently we only return responses as dataframes if their query is a SQL query. Since we already have the code for parsing dataframes from JSON responses, we might provide an option that lets users say "I'm going to output a bunch of JSON here, please parse it and return it as a dataframe", even if their code isn't a SQL query. This could be useful and could allow users to get arbitrary semi-structured data back from the remote cluster. I'm not sure how feasible this is, or if this would be too error-prone, but I think it makes sense and could be really useful in some niche scenarios.

Revise API

Consolidate magics and commands. Clean up UX.

Error when visualizing empty dataframe

Steps to reproduce:

%sql SHOW TABLES or %hive SHOW TABLES when there are no tables (i.e. the result dataframe is empty).
The data viz widget pops up. Switch from "table" to any of the other chart styles.
You get an exception. This except doesn't go away even if you switch back to the table graph type.
```
ValueError: cannot label index with a null key
```

reliablehttpclient retry on retriable status codes & add unit tests for it

Does not retry now. Look at TODO

Change Magic Contract

Change the way that the magic is used so that it is run once to specify the livy connection and all subsequent cells are run against the remote cluster. This clears up the confusion between the local and remote namespaces without being as heavyweight of a solution as a new kernel.

Expose livy endpoint management through wrapper kernel

Rename --endpoint param to magics to --session

Make -e be -s

Manage livy endpoint from magics

This will be the API:

%spark add session_name language conn_string
will create a session against the endpoint specified
%spark info
will display the info for the sessions created in that notebook
%spark config <configuration_overrides>
will add session configs for subsequent sessions
%spark info conn_string
will list the sessions for a given livy endpoint by providing session_id, language, state
%spark delete session_name
will delete a session by its name from the notebook that created it
%spark delete conn_string session_id
will delete a session for a given endpoint by its id
%spark cleanup
will delete all sessions created by the notebook
%spark cleanup conn_string
will delete all session for the given livy endpoint

This covers #56, #75, and #76 for magics in Python kernel.
We are not designing the API for the wrapper kernels here and we'll tackle that as a separate improvement.

ping @msftristew @ellisonbg to take a look when they can

Restructure repos

We should create other repos for a Livy client, a configuration getter, a logger, and etcetera. Then we should put our python files into folders that make actual sense.

In doing this, we might want to add docstring comments for public methods in every repo.

Improve docstring showed for the %spark magic

When user runs %spark? in the notebook, the docstring should include a small paragraph stateing that the magic allows you to connect to a Livy endpoint by creating sessions that are tied to a particular language and that every session can run code in that language + SparkSQL.

Session properties are not back propagated to configuration class when set

Improve parsing for dataframe generation and gracefully handle errors

Improve dataframe parsing. In particular, the pyspark livy client should not be calling "eval'. This may require further investigation into the structure of the strings that Livy may possibly return to the client.
Tighten up error handling. This requires enumerating all the possible errors that we may run into during parsing, and converting them smartly into DataFrameParseExceptions so that error messages can be displayed to the user nicely.

Expose session configs through wrapper kernel

Installation is hard

See title; installation is a few steps, which should ideally all be subsumed by a Pip install.

Allow user to specify memory/cores/etc for every session

Improve initial documentation

Travis support to have tests run

Enable Travis support

Serialize client manager state to disk so that restarted kernel can read state and configure itself automatically

wait for state doesn't return immediately if state is final

When state goes into a final state, like error, wait for state should immediately return.

Explore alternate SQL contexts

Sparkmagic currently supports only vanilla SQLContexts as first class interfaces. If a user wants to use an alternate context (like a HiveQLContext), they can do so through the pyspark or scala interfaces, but they must handle the context itself. It may be useful to allow the user to specify a type of SQLContext when using the SQL interface.

Create pandas dataframes from Livy JSON responses for Scala SQL

Expose %delete session_number through wrapper kernels

So that user can delete individual sessions from a wrapper kernel. To be used with #75

Refactor by extracting client manager

Allow user to specify how many rows/what method to use when doing a sql query

A user should be able to specify how many rows to get back in a case by case basis.

Also, the user might want to sometimes do a take, sometimes do a sample, and sometimes do something else when getting results back in a case by case basis.

Wait for state timeout should substract time based on elapsed clock time, not on parameter

Missing fields produce Altair errors

I was doing the following hive query when I discovered an error:

%hive SELECT * FROM hivesampletable WHERE deviceplatform = 'Android'

With the following pandas df:

records_text = '{"clientid":"8","querytime":"18:54:20","market":"en-US","deviceplatform":"Android","devicemake":"Samsung","devicemodel":"SCH-i500","state":"California","country":"United States","querydwelltime":13.9204007,"sessionid":0,"sessionpagevieworder":0}\n{"clientid":"23","querytime":"19:19:44","market":"en-US","deviceplatform":"Android","devicemake":"HTC","devicemodel":"Incredible","state":"Pennsylvania","country":"United States","sessionid":0,"sessionpagevieworder":0}'
json_array = "[{}]".format(",".join(records_text.split("\n")))
import json
d = json.loads(json_array)
result = pd.DataFrame(d)
result

the NaN for querydwelltime produces the following error:

Javascript error adding output!
TypeError: Cannot read property 'prop' of undefined
See your browser Javascript console for more details.

The vegalite spec produced by Altair is:

{'config': {'width': 600, 'gridOpacity': 0.08, 'gridColor': u'black', 'height': 400}, 'marktype': 'point', 'data': {'formatType': 'json', 'values': [{u'deviceplatform': u'Android', u'devicemodel': u'SCH-i500', u'country': u'United States', u'sessionpagevieworder': 0, u'state': u'California', u'clientid': u'8', u'sessionid': 0, u'querytime': u'18:54:20', u'devicemake': u'Samsung', u'market': u'en-US', u'querydwelltime': 13.9204007}, {u'deviceplatform': u'Android', u'devicemodel': u'Incredible', u'country': u'United States', u'sessionpagevieworder': 0, u'state': u'Pennsylvania', u'clientid': u'23', u'sessionid': 0, u'querytime': u'19:19:44', u'devicemake': u'HTC', u'market': u'en-US', u'querydwelltime': nan}]}}

For commentary and possible fixes, this issue is tracked by lightning renderer in:
lightning-viz/lightning-python#34

We might need to address this in Altair too.

cc @ellisonbg @mathisonian

Make pandas dtypes correct

This includes running nunique on the string columns and then making them categorical values instead of just strings.

Create pandas dataframes from Livy JSON responses for PySpark SQL

Create widget for session manager

This prevents users from typing their credentials in clear text.

Users should not be allowed to manage sessions by using text subcommands if they are running the notebook in the browser. It should be allowed for users in a terminal, though.

Improve error when credentials are not provided

Currently, when you fail to specify credentials in the config file, the kernel crashes. The exception (that you need to provide credentials for Livy) is visible in Jupyter's logs but it would be ideal if the kernel didn't crash and instead the error message was written to the screen.

Should the default Livy URL be localhost:8998?

See title. I wonder if this change would increase the odds that someone can do a clean install of the wrapper kernels and have everything "just work" without having to mess with configurations at all.

use_auto_viz is false by default?

Is there a reason for this?

Incorrect visualizations on some sample data

Ran the following code:

hvac = sc.textFile('wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv')
from pyspark.sql import Row
Doc = Row("TargetTemp", "ActualTemp", "System", "SystemAge", "BuildingID")
def parseDocument(line):
    values = [str(x) for x in line.split(',')]
    return Doc(values[2], values[3], values[4], values[5], values[6])
documents = hvac.filter(lambda s: "Date" not in s).map(parseDocument)
df = sqlContext.createDataFrame(documents)
df.registerTempTable('data')

and then

%select * from data limit 100

The visualizations, at least for the pie graphs, are wrong. Screenshot:

Clearly there is no building where the desired target temperature is 1.

Return well formated string/error from Livy to user

Right now, magics return the result from Livy without being aware of whether the string is a result or an error.

In the case of a result, magics should nicely print the result back.
In the case of an error, it should be clear to the user that an error just happened in the cluster. A stacktrace should be printed if available.

User is not notified if context creation fails

I just ran into this when a HiveContext failed to create in Scala and I wasn't notified that any error happened. When I then tried to run %hive SHOW TABLES, an error was thrown that hiveContext wasn't defined.

LivySession: implement timeout for wait_for_state

Look at TODO

Investigate perf issues with auto_viz

Can all graph types handle ~2500 result rows?

Kill all sessions for a given Livy endpoint

When a user adds a new session, the user might find out that leaked/unused Livy sessions are taking resources up and might want to kill some of them.

How can we remove the Lightning initialization output from showing up?

@ellisonbg you might have some ideas.

%hive show tables doesn't seem to work

This looks like a regression when the improved output rendering change was introduced. %hive SHOW TABLES crashes explaining it doesn't know how to convert the output into a dataframe (the output is an empty list). It definitely doesn't work when the list of tables is empty; it also probably does not work if that list is nonempty.

Add %info to wrapper kernels

This should display to the user:

Livy endpoint kernel will hit
Sessions for the given endpoint (number, state, and type)

Rethink configuration

This is part of a greater architectural issue around configurations. There should be a configuration module which:

Automatically loads data from a configuration file,
loads the configuration with an appropriate set of defaults if things are missing from the configuration file, and
allows a developer to substitute a different configuration module if necessary (i.e. for tests).