Giter Site home page Giter Site logo

yonglehou / spark-sklearn Goto Github PK

View Code? Open in Web Editor NEW

This project forked from databricks/spark-sklearn

0.0 1.0 0.0 507 KB

Scikit-learn integration package for Spark

License: Apache License 2.0

Shell 3.93% Scala 0.36% Python 32.60% CSS 5.98% JavaScript 30.76% HTML 26.37%

spark-sklearn's Introduction

#Scikit-learn integration package for Apache Spark

This package contains some tools to integrate the Spark computing framework with the popular scikit-learn machine library. Among other tools:

  • train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default in scikit-learn.
  • convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices.
  • (experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors.

It focuses on problems that have a small amount of data and that can be run in parallel.

  • for small datasets, it distributes the search for estimator parameters (GridSearchCV in scikit-learn), using Spark,

  • for datasets that do not fit in memory, we recommend using the distributed implementation in Spark MLlib.

    NOTE: This package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib).

Difference with the sparkit-learn project The sparkit-learn project aims at a comprehensive integration between Spark and scikit-learn. In particular, it adds some primitives to distribute numerical data using Spark, and it reimplements some of the most common algorithms found in scikit-learn.

License

This package is released under the Apache 2.0 license. See the LICENSE file.

Installation

This package is available on PYPI:

pip install spark-sklearn

This project is also available as as Spark package.

The developer version has the following requirements:

  • a recent release of scikit-learn. Release 0.17 has been tested, older versions may work too.
  • Spark >= 2.0. Spark may be downloaded from the Spark official website. In order to use this package, you need to use the pyspark interpreter or another Spark-compliant python interpreter. See the Spark guide for more details. NOTICE: currently, this package uses the nightly 2.0.0 snapshot, available here (TODO: remove reference after 2.0.0 release).
  • nose (testing dependency only)
  • Pandas, if using the Pandas integration or testing. Pandas==0.18 has been tested.

If you want to use a developer version, you just need to make sure the python/ subdirectory is in the PYTHONPATH when launching the pyspark interpreter:

PYTHONPATH=$PYTHONPATH:./python:$SPARK_HOME/bin/pyspark

Running tests You can directly run tests:

cd python && ./run-tests.sh

This requires the environment variable SPARK_HOME to point to your local copy of Spark.

Example

Here is a simple example that runs a grid search with Spark. See the Installation section on how to install the package.

from sklearn import svm, grid_search, datasets
from spark_sklearn import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
clf = GridSearchCV(sc, svr, parameters)
clf.fit(iris.data, iris.target)

This classifier can be used as a drop-in replacement for any scikit-learn classifier, with the same API.

Documentation

More extensive documentation (generated with Sphinx) is available in the python/doc_gen/index.html file.

Changelog

  • 2015-12-10 First public release (0.1)

  • 2016-01-10 Package fix release (0.1.1)

  • 0.1.2:

    • python 3 support

spark-sklearn's People

Contributors

thunterdb avatar vlad17 avatar jkbradley avatar mengxr avatar zero323 avatar hahnicity avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.