Giter Site home page Giter Site logo

databricks_functional_testing's Introduction

Functional testing for databricks notebooks

To be able to test a notebook, it needs to be testable. A simple way to achieve this is to pass parameters which define input and output of the notebook. To execute the test cases and perform assertions, a separate notebook can be used.

The sample use-case project (https://github.com/ivangeorgiev/databricks_functional_testing) illustrates this approach with two notebooks:

  • The load_movies notebook loads movie and rating data stored in JSON format, enriches the movie data ( MovieLens) with rating aggregations and writes the result into Spark table.
  • The test_load_movies notebook creates pre-defined data for the test case than calls the load_movies notebook and performs assertions on the loaded table.

load_movies Notebook

The input and output of the notebook is defined via notebook parameters.

default_root = "dbfs:/FileStore/movielens/test-data"
dbutils.widgets.text("movie_data", 
                     defaultValue=f"{default_root}/movies.json", 
                     label="Movie source path")
dbutils.widgets.text("ratings_data", 
                     defaultValue=f"{default_root}/ratings.json", 
                     label="Ratings source path")
dbutils.widgets.text("target_table",
                     defaultValue="test_load_movies",
                     label="Output movie table name")

Setting default values for the parameters to actual development data makes easier the development process. It is convenient to store development and test data in DBFS. In production the location could be remote URL, for example Blob, ADLS Gen 2.

test_load_movies Notebook

To implement proper test case Python unittest module is used.

class TestLoadMovies(unittest.TestCase):
  # Define test case constants
  movies_path = "dbfs:/FileStore/movielens/test-data/movies.json"
  ratings_path = "dbfs:/FileStore/movielens/test-data/ratings.json"
  target_table = "test_load_movies"
  
  # Define input fixtures
  fixture_movies_json = """..."""
  fixture_ratings_json = """..."""
    
  # Define helpers
  def get_loaded_movies(self, movies_table):
    # ...
  
  # The setUp method is called before each test.
  # Use it to setup sandbox.
  def setUp(self):
    # ...
  
  # Define helpers for setting up test data 
  def given_test_data(self):
    # ...

  # Stimulus (actions) can be defined as methods.
  # This is the place where we run the notebook.
  def when_load_movies(self):
    # ...
  
  # All methods starting with "test" are tests, executed by unittest.
  # use the test method to build your test scenario:
  #   GIVEN ... (setup)
  #   WHEN  ... (action or stimulus)
  #   THEN  ... (validation or assert)
  def test_movies_loaded(self):
    # ...

unittest provides simple execution approach which is compatible with Databricks notebooks.

suite = unittest.TestLoader().loadTestsFromTestCase(TestLoadMovies)
runner = unittest.TextTestRunner(verbosity=2)
runner.run(suite)

Functional test notebooks can be integrated into CI/CD pipelines and executed automatically.

How and when to use?

Illustrated approach can be used efficiently for notebook functional testing. Notebooks themselves are not great way to engineer software. For big projects it is recommended to follow good software development practices and package your code in modules and execute them directly as Databricks jobs. The modules should be unit tested using frameworks like pytest, unittest etc.

Notebooks still can used, but they should be short with only responsibility to orchestrate the data flow - glue the boundaries between input, processing and output. These notebooks can be tested with the described approach.

What's next?

Almost every test scenario requires to verify that actually produced dataset is equal to the exepected dataset. Here is a useful technique which helps to test two Spark DataFrames are equal.

databricks_functional_testing's People

Contributors

ivangeorgiev avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.