Giter Site home page Giter Site logo

ronstein2000 / spark-sas7bdat Goto Github PK

View Code? Open in Web Editor NEW

This project forked from saurfang/spark-sas7bdat

0.0 2.0 0.0 16.21 MB

Splittable SAS (.sas7bdat) Input Format for Hadoop and Spark SQL

Home Page: http://spark-packages.org/package/saurfang/spark-sas7bdat

License: Other

Scala 25.67% Java 74.33%

spark-sas7bdat's Introduction

SparkSQL SAS (sas7bdat) Input Library

A library for parsing SAS data (sas7bdat) with Spark SQL. This also includes a SasInputFormat designed for Hadoop mapreduce. This format is splittable when input is uncompressed thus can achieve high parallelism for a large SAS file.

This library is inspired by spark-csv and currently uses parso for parsing as it is the only public available parser that handles both forms of SAS compression (CHAR and BINARY). Note parso is licensed under GPL-3 and subsequently this library is also licensed as such.

Build Status

Requirements

This library requires Spark 1.4+

How To Use

This package is published using sbt-spark-package and linking information can be found at http://spark-packages.org/package/saurfang/spark-sas7bdat

Features

This package allows reading SAS files in local or distributed filesystem as Spark DataFrames.

Schema is automatically inferred from meta information embedded in the SAS file.

Thanks to the splittable SasInputFormat, we are able to convert a 200GB (1.5Bn rows) .sas7bdat file to .csv files using 2000 executors in under 2 minutes.

SQL API

SAS data can be queried in pure SQL by registering the data as a (temporary) table.

CREATE TEMPORARY TABLE cars
USING com.github.saurfang.sas.spark
OPTIONS (path "cars.sas7bdat")

Scala API

The recommended way to load SAS data is using the load functions in SQLContext.

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.github.saurfang.sas.spark").load("cars.sas7bdat")
df.select("year", "model").write.format("com.databricks.spark.csv").save("newcars.csv")

You can also use the implicits from import com.github.saurfang.sas.spark._.

import org.apache.spark.sql.SQLContext
import com.github.saurfang.sas.spark._

val sqlContext = new SQLContext(sc)

val cars = sqlContext.sasFile("cars.sas7bdat")

import com.databricks.spark.csv._
cars.select("year", "model").saveAsCsvFile("newcars.csv")

SAS Export Runner

We also included a simple SasExport Spark program that converts .sas7bdat to .csv or .parquet file:

sbt "run input.sas7bdat output.csv"
sbt "run input.sas7bdat output.parquet"

To achieve more parallelism, use spark-submit script to run it on a Spark cluster. If you don't have a spark cluster, you can always run it in local mode and take advantage of multi-core.

For further flexibility, you can use spark-shell:

spark-shell --master local[4] --packages saurfang:spark-sas7bdat:1.1.4-s_2.10

In the shell you can do data analysis like:

import com.github.saurfang.sas.spark._
val random = sqlContext.sasFile("src/test/resources/random.sas7bdat").cache
//random: org.apache.spark.sql.DataFrame = [x: double, f: double]
random.count
//res13: Long = 1000000
random.filter("x > 0.4").count
//res14: Long = 599501

Caveats

  1. spark-csv writes out null as "null" in csv text output. This means if you read it back for a string type, you might actually read "null" instead of null. The safest option is to export in parquet format where null is properly recorded. See databricks/spark-csv#147 for alternative solution.

Related Work

spark-sas7bdat's People

Contributors

saurfang avatar

Watchers

Ronald Steinhau avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.