Giter Site home page Giter Site logo

spoddutur / cloud-based-sql-engine-using-spark Goto Github PK

View Code? Open in Web Editor NEW
32.0 7.0 14.0 189 KB

Cloud-based SQL engine using SPARK where data is accessible as JDBC/ODBC data source via Spark ThriftServer.

Scala 36.68% Java 63.32%
apache-spark thrift-server spark-thrift-server sql-engine sparksql jdbc beeline hadoop-framework

cloud-based-sql-engine-using-spark's Introduction

Spark As CLoudBased SQL Engine

This project shows how to use SPARK as Cloud-based SQL Engine and expose your big-data as a JDBC/ODBC data source via the Spark thrift server.

1. Central Idea

Traditional relational Database engines like SQL had scalability problems and so evolved couple of SQL-on-Hadoop frameworks like Hive, Cloudier Impala, Presto etc. These frameworks are essentially cloud-based solutions and they all come with their own advantages and limitations. This project will demo how SparkSQL comes across as one more SQL-on-Hadoop framework.

2. Architecture

Following picture illustrates how ApacheSpark can be used as SQL-on-Hadoop framework to serve your big-data as a JDBC/ODBC data source via the Spark thrift server.:

  • Data from multiple sources can be pushed into Spark and then exposed as SQLtable
  • These tables are then made accessible as a JDBC/ODBC data source via the Spark thrift server.
  • Multiple clients like Beeline CLI, JDBC, ODBC or BI tools like Tableau connect to Spark thrift server.
  • Once the connection is established, ThriftServer will contact SparkSQL engine to access Hive or Spark temp tables and run the sql queries on ApacheSpark framework.
  • Spark Thrift basically works similar to HiveServer2 thrift where HiveServer2 submits the sql queries as Hive MapReduce job vs Spark thrift server will use Spark SQL engine which underline uses full spark capabilities.

To know more about this topic, please refer to my blog here where I briefed the concept in detail.

3. Structure of the project:

  • data: Contains input json used in MainApp to register sample data with SparkSql.
  • src/main/java/MainApp.scala: Spark 2.1 implementation where it starts SparkSession and registers data from input.json with SparkSQL. (To keep the spark-session alive, there's a continuous while-loop in there).
  • src/test/java/TestThriftClient.java: Java class to demo how to connect to thrift server as JDBC source and query the registered data

4. How to run this project?

This project does demo 2 things:

  • 4.1. How to register data with SparkSql
  • 4.2. How to query registered data via Spark ThriftServer - using Beeline and JDBC

4.1 How to register data with SparkSql

  • Download this project.
  • Build it: mvn clean install and
  • Run MainApp: spark-submit --class MainApp cloud-based-sql-engine-using-spark.jar. Tht's it!
  • It'll register some sample data in records table with SparkSQL.

4.2 How to query registered data via Spark Thrift Server using Beeline and JDBC?

For this, first connect to Spark ThriftServer. Once the connection is established, just like HiveServer2, access Hive or Spark temp tables to run the sql queries on ApacheSpark framework. I'll show 2 ways to do this:

  1. Beeline: Perhaps, the simplest is to use beeline command-line tool provided in Spark's bin folder.
`$> beeline`
Beeline version 2.1.1-amzn-0 by Apache Hive

// Connect to spark thrift server..
`beeline> !connect jdbc:hive2://localhost:10000`
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000:
Enter password for jdbc:hive2://localhost:10000:

// run your sql queries and access data..
`jdbc:hive2://localhost:10000> show tables;,`
  1. Java JDBC: Please refer to this project's test folder where I've shared a java example TestThriftClient.java to demo the same.

5. Requirements

  • Spark 2.1.0, Java 1.8 and Scala 2.11

6. References:

  • Complete guide and references to this project are briefed in my blog here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.