Giter Site home page Giter Site logo

yugabyte-spark-workshop's Introduction

yugabyte-spark-workshop

Hands-on workshop to build apps using Yugabyte Cloud and Spark 3.x. You will be building Spark application with Yugabyte Spark connector to interact with Yugabye Cloud and see how it suppports JSONB data natively. We will be using SparkSQL and windowing functions to interact with data.

Architecture of YB Spark workshop application

  • High Level Tasks:
    • Reading from YugabyteDB YCQL table
    • Performing ETL operation
    • Writing back to YugabyteDB
    • Column pruning and predicate pushdown

Prerequisites

  • Basic understanding of Apache Spark
  • Basic familiarity with YugabyteDB fundamentals - https://docs.yugabyte.com/latest/explore/
  • Familiarity with running Linux commands and bash CLI
  • Basic experience with Scala and Java

Technical Requirements

Agenda

  • Overview of Yugabyte Architecture
  • Yugabyte’s YCQL API
  • YugabyteDB Spark Connector
  • Hands-on Workshop

Session Slides

Architecture of YB Spark workshop application

Hands-on Workshop

  • Check Java version: 1.8 required - java -version

  • Create a free cluster from Yugabyte cloud

    • Install Yugabyte client shell cd /Users/weiwang curl -sSL https://downloads.yugabyte.com/get_clients.sh | bash export YUGABYTE_HOME=/Users/weiwang/yugabyte-client-2.6
    • Download certficate to ~/Downloads directory: root.crt
    • Create truststore keytool -keystore yb-keystore.jks -storetype 'jks' -importcert -file root.crt -keypass 'ybcloud' -storepass 'ybcloud' -alias ~/Documents/spark3yb/root_crt -noprompt keytool -list -keystore yb-keystore.jks -storepass ybcloud
    • Connect to the cluster SSL_CERTFILE=/Users/weiwang/Downloads/root.crt $YUGABYTE_HOME/bin/ycqlsh 748fdee2-aabe-4d75-a698-a6514e0b19ff.aws.ybdb.io 9042 -u admin --ssl
    • Create keyspace, tables and insert testing data namespace.sql
  • Install Spark 3.0

    • Download Spark 3.0 wget https://dlcdn.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop2.7.tgz tar xvf spark-3.0.3-bin-hadoop2.7.tgz cd spark-3.0.3-bin-hadoop2.7
    • Invoke Spark shell with Yugabyte Spark connector ./bin/spark-shell --conf spark.cassandra.connection.host=127.0.0.1 --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions --packages com.yugabyte.spark:spark-cassandra-connector_2.12:3.0-yb-8
  • Build an application

  • import libraires //import libraries import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.Row import com.datastax.spark.connector._ import org.apache.spark.sql.cassandra.CassandraSQLRow import org.apache.spark.sql.cassandra._ import com.datastax.spark.connector.cql.CassandraConnectorConf import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window import com.datastax.spark.connector.cql.CassandraConnector

  • YB CLoud connectivity info val host = "748fdee2-aabe-4d75-a698-a6514e0b19ff.aws.ybdb.io" val keyspace = "test" val table = "employees_json" val user = "admin" val password = "your password for admin" val keyStore ="/Users/weiwang/Documents/spark3yb/yb-keystore.jks"

  • Create Spark conf val conf = new SparkConf().setAppName("yb.spark-jsonb").setMaster("local[1]").set("spark.cassandra.connection.localDC", "us-east 2").set("spark.cassandra.connection.host", "127.0.0.1").set("spark.sql.catalog.ybcatalog", "com.datastax.spark.connector.datasource.CassandraCatalog").set("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions")

  • Create Spark session val spark = SparkSession.builder().config(conf).config("spark.cassandra.connection.host", host).config("spark.cassandra.connection.port", "9042").config("spark.cassandra.connection.ssl.clientAuth.enabled", true).config("spark.cassandra.auth.username", user).config("spark.cassandra.auth.password", password).config("spark.cassandra.connection.ssl.enabled", true).config("spark.cassandra.connection.ssl.trustStore.type", "jks").config("spark.cassandra.connection.ssl.trustStore.path", keyStore).config("spark.cassandra.connection.ssl.trustStore.password", "ybcloud").withExtensions(new CassandraSparkExtensions).getOrCreate()

  • Read from YCQL table val df_yb = spark.read.table("ybcatalog.test.employees_json")

  • Perform ETL val windowSpec = Window.partitionBy("department_id").orderBy("salary") df_yb.withColumn("row_number",row_number.over(windowSpec)).show(false) df_yb.withColumn("rank",rank().over(windowSpec)).show(false)

  • Write back to YCQL table df_yb.write.cassandraFormat("employees_json_copy", "test").mode("append").save() //to verify val sqlDF = spark.sql("SELECT * FROM ybcatalog.test.employees_json_copy order by department_id").show(false)

  • Native JSONB support demo

  • Using JSONB Column Pruning val query = "SELECT department_id, employee_id, get_json_object(phone, '$.code') as code FROM ybcatalog.test.employees_json WHERE get_json_string(phone, '$.key(1)') = '1400' order by department_id limit 2"; val df_sel1=spark.sql(query) df_sel1.explain

  • Predicate pushed down val query = "SELECT department_id, employee_id, get_json_object(phone, '$.key[1].m[2].b') as key FROM ybcatalog.test.employees_json WHERE get_json_string(phone, '$.key[1].m[2].b') = '400' order by department_id limit 2"; val df_sel2 = spark.sql(query) df_sel2.show(false) df_sel2.explain

=======

yugabyte-spark-workshop's People

Contributors

ameyb avatar yb-wangtx avatar

Watchers

Mikhail Bautin avatar James Cloos avatar  avatar  avatar

yugabyte-spark-workshop's Issues

Keytool gives error

When I ran keytool -keystore yb-keystore.jks -storetype 'jks' -importcert -file root.crt -keypass 'ybcloud' -storepass 'ybcloud' -alias ~/Documents/spark3yb/root_crt -noprompt keytool -list -keystore yb-keystore.jks -storepass ybcloud

It gives the error below. Which should we run? import cert or list?

keytool error: java.lang.Exception: Only one command is allowed: both -importcert and -list were specified.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.