Giter Site home page Giter Site logo

spark-ecs-connector's Introduction

Bucket Metadata Search with Spark SQL (2.x)

The spark-ecs-connector project makes it possible to view an ECS bucket as a Spark dataframe. Each row in the dataframe corresponds to an object in the bucket, and each column coresponds to a piece of object metadata.

How it Works

Spark SQL supports querying external data sources and rendering the results as a dataframe. With the PrunedFilteredScan trait, the external data source handles column pruning and predicate pushdown. In other words, the WHERE clause is pushed to ECS by taking advantage of the bucket metadata search feature of ECS 2.2.

Screenshot

Using

Linking to your Spark 2.x Application

The library is published to Maven Central. Link to the library using these dependency coordinates:

com.emc.ecs:spark-ecs-connector_2.11:1.4.2

Using in Zeppelin

  1. Install Zeppelin 0.7+.
  2. export SPARK_LOCAL_IP=127.0.0.1
  3. bin/zeppelin.sh

Create a notebook with the following commands. Replace *** with your S3 credentials.

%dep
z.load("com.emc.ecs:spark-ecs-connector_2.11:1.4.2")
import java.net.URI
import com.emc.ecs.spark.sql.sources.s3._

val endpointUri = new URI("http://10.1.83.51:9020/")
val credential = ("***ACCESS KEY ID***", "***SECRET ACCESS KEY***")

val df = sqlContext.read.bucket(endpointUri, credential, "ben_bucket", withSystemMetadata = false)
df.createOrReplaceTempView("ben_bucket")
%sql
SELECT * FROM ben_bucket 
WHERE `image-viewcount` >= 5000 AND `image-viewcount` <= 10000

Contributing

Building

The project use the Gradle build system and includes a script that automatically downloads Gradle.

Build and install the library to your local Maven repository as follows:

$ ./gradlew publishShadowPublicationToMavenLocal

TODO

  1. Implement 'OR' pushdown. ECS supports 'or', but not in combination with 'and'.
  2. Avoid sending a query containing a non-indexable key.

spark-ecs-connector's People

Contributors

eronwright avatar meetgrinder avatar twincitiesguy avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.