Giter Site home page Giter Site logo

spark-data-profiler's Introduction

spark-data-profiler

Profiles data stored in a file system or any other datasource.

Profile

Data profiling works similar to df.describe(), but acts on non-numeric columns.

To use profile execute the implicit method profile on a DataFrame. An example follows.

import com.kajjoy.spark.data.profiler.profile.DataFrameUtils._

val df = Seq(

      ("Mets","1986", "New York", "27896702"),
      ("Yankees", "2009", "New York", "3063405"),
      ("Cubs", "2016", "Chicago", "3232420"),
      ("White Sox","2005","Chicago", "1746293"),
      ("Nationals","","Washington", "2481938"),
      ("Senators","1924","Washington",null)
    ).toDF("Team", "Last Championship", "City", "2016 Attendance")

df.profile.show

will produce results like this:

+-----------------+------------+-------------+-------------+-----------+------------+---------------+----------+
|      Column Name|Record Count|Unique Values|Empty Strings|Null Values|Percent Fill|Percent Numeric|Max Length|
+-----------------+------------+-------------+-------------+-----------+------------+---------------+----------+
|             Team|           6|            6|            0|          0|       100.0|            0.0|         9|
|Last Championship|           6|            6|            1|          0|        83.3|           83.3|         4|
|             City|           6|            3|            0|          0|       100.0|            0.0|         8|
|  2016 Attendance|           6|            6|            0|          1|        83.3|           83.3|         8|
+-----------------+------------+-------------+-------------+-----------+------------+---------------+----------+

Results Explained

The profile command is similar to the Describe except that it gives us statistics on non-numeric columns. This allows us to get a quick overview of the contents of a DataFrame before getting into in depth analysis.

We provide the following metrics on each column of the input DataFrame

  • Record Count: The total number of records in the column
  • Unique Values: The distinct count of unique values in a column
  • Empty Strings: The count of empty strings in a column
  • Null Values: The count of null values in a column
  • Percent Fill: The percentage of records that are not empty string or null
  • Percent Numeric: The percentage of records that are numeric
  • Max Len: The max length of data

A Note about Performance

Limit the columns to only those that are required as these calculations are processor intensive. Several queries are executed for each column.

In other words before running profile on a DataFrame it is advisable to select only the required colums:

someDataFrame.select("column1","column2").profile

will perform better than

someDataFrame.profile

Another Note about Performance

For larger data frames run it on a sample

someDataFrame.sample(true, .01).select("col1").profile

spark-data-profiler's People

Contributors

jasonsatran avatar mrpowers avatar

Watchers

James Cloos avatar Dipayan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.