Giter Site home page Giter Site logo

sarathchandrabandaru / spark-expectations Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nike-inc/spark-expectations

0.0 0.0 0.0 6.26 MB

A Python Library to support running data quality rules while the spark job is running⚡

Home Page: https://engineering.nike.com/spark-expectations

License: Apache License 2.0

Shell 0.55% Python 99.00% Makefile 0.33% Dockerfile 0.12%

spark-expectations's Introduction

Spark-Expectations

CodeQL build codecov Code style: black Checked with mypy License PYPI version

Spark Expectations is a specialized tool designed with the primary goal of maintaining data integrity within your processing pipeline. By identifying and preventing malformed or incorrect data from reaching the target destination, it ensues that only quality data is passed through. Any erroneous records are not simply ignored but are filtered into a separate error table, allowing for detailed analysis and reporting. Additionally, Spark Expectations provides valuable statistical data on the filtered content, empowering you with insights into your data quality.


The documentation for spark-expectations can be found here

Contributors

Thanks to all the contributors who have helped ideate, develop and bring it to its current state

Contributing

We're delighted that you're interested in contributing to our project! To get started, please carefully read and follow the guidelines provided in our contributing document

What is Spark Expectations?

Spark Expectations is a Data quality framework built in Pyspark as a solution for the following problem statements:

  1. The existing data quality tools validates the data in a table at rest and provides the success and error metrics. Users need to manually check the metrics to identify the error records
  2. The error data is not quarantined to an error table or there are no corrective actions taken to send only the valid data to downstream
  3. Users further downstream must consume the same data incorrectly, or they must perform additional calculations to eliminate records that don't comply with the data quality rules.
  4. Another process is required as a corrective action to rectify the errors in the data and lot of planning is usually required for this acitivity

Spark Expectations solves these issues using the following principles:

  1. All the records which fail one or more data quality rules, are by default quarantined in an _error table along with the metadata on rules that failed, job information etc. This makes it easier for analysts or product teams to view the incorrect data and collaborate with the teams responsible for correcting and reprocessing it.
  2. Aggregated metrics are provided for the raw data and the cleansed data for each run along with the required metadata to prevent recalculation or computation.
  3. The data that doesn't meet the data quality contract or the standards is not moved to the next level or iterations unless or otherwise specified.

Features Of Spark Expectations

Please find the spark-expectations flow and feature diagrams below

Spark - Expectations Setup

Configurations

In order to establish the global configuration parameter for DQ Spark Expectations, you must define and complete the required fields within a variable. This involves creating a variable and ensuring that all the necessary information is provided in the appropriate fields.

from spark_expectations.config.user_config import *

se_global_spark_Conf = {
    se_notifications_enable_email: False,
    se_notifications_email_smtp_host: "mailhost.nike.com",
    se_notifications_email_smtp_port: 25,
    se_notifications_email_from: "<sender_email_id>",
    se_notifications_email_to_other_nike_mail_id: "<receiver_email_id's>",
    se_notifications_email_subject: "spark expectations - data quality - notifications", 
    se_notifications_enable_slack: True,
    se_notifications_slack_webhook_url: "<slack-webhook-url>", 
    se_notifications_on_start: True, 
    se_notifications_on_completion: True,
    se_notifications_on_fail: True,
    se_notifications_on_error_drop_exceeds_threshold_breach: True, 
    se_notifications_on_error_drop_threshold: 15,
}

Spark Expectations Initialization

For all the below examples the below import and SparkExpectations class instantiation is mandatory

from spark_expectations.core.expectations import SparkExpectations


# product_id should match with the "product_id" in the rules table
se: SparkExpectations = SparkExpectations(product_id="your-products-id")
  1. Instantiate SparkExpectations class which has all the required functions for running data quality rules
from spark_expectations.config.user_config import * 


@se.with_expectations( 
    se.reader.get_rules_from_table(
        product_rules_table="pilot_nonpub.dq.dq_rules",
        table_name="pilot_nonpub.dq_employee.employee",  
        dq_stats_table_name="pilot_nonpub.dq.dq_stats" 
    ),
    write_to_table=True, 
    write_to_temp_table=True,
    row_dq=True, 
    agg_dq={
        se_agg_dq: True,  
        se_source_agg_dq: True,  
        se_final_agg_dq: True, 
    },
    query_dq={ 
        se_query_dq: True,  
        se_source_query_dq: True, 
        se_final_query_dq: True, 
        se_target_table_view: "order", 
    },
    spark_conf=se_global_spark_Conf,

)
def build_new() -> DataFrame:
    _df_order: DataFrame = (
        spark.read.option("header", "true")
        .option("inferSchema", "true")
        .csv(os.path.join(os.path.dirname(__file__), "resources/order.csv"))
    )
    _df_order.createOrReplaceTempView("order")  

    _df_product: DataFrame = (
        spark.read.option("header", "true")
        .option("inferSchema", "true")
        .csv(os.path.join(os.path.dirname(__file__), "resources/product.csv"))
    )
    _df_product.createOrReplaceTempView("product") 

    _df_customer: DataFrame = (
        spark.read.option("header", "true")
        .option("inferSchema", "true")
        .csv(os.path.join(os.path.dirname(__file__), "resources/customer.csv"))
    )

    _df_customer.createOrReplaceTempView("customer") 

    return _df_order 

spark-expectations's People

Contributors

asingamaneni avatar phanikumarvemuri avatar jhollow avatar umeshsp22 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.