Giter Site home page Giter Site logo

outerbounds / metaflow-card-notebook Goto Github PK

View Code? Open in Web Editor NEW
24.0 3.0 3.0 3.54 MB

Render Jupyter Notebooks With Metaflow Cards

License: Apache License 2.0

Python 51.54% Jupyter Notebook 34.48% Makefile 10.76% Shell 3.23%
metaflow metaflow-cards jupyter jupyter-notebook machine-learning mlops python ml visualization ml-infrastructure

metaflow-card-notebook's Introduction

Test Flow

metaflow-card-notebook

Use @card(type='notebook') to programatically run & render notebooks in your flows.

Installation

pip install metaflow-card-notebook

Motivation

You may have seen this series of blog posts that have been written about Notebook Infrastructure at Netflix. Of particular interest is how notebooks are programmatically run, often in DAGs, to generate reports and dashboards:

Parameterized Execution of Notebooks Notebooks in DAGs Dependency Management & Scheduling

This way of generating reports and dashboards is very compelling, as it lets data scientists create content using environments and tools that they are familiar with. With @card(type='notebook') you can programmatically run and render notebooks as part of a DAG. This card allows you to accomplish the following:

  • Run notebook(s) programmatically in your Metaflow DAGs.
  • Access data from any step in your DAG so you can visualize it or otherwise use it to generate reports in a notebook.
  • Render your notebooks as reports or model cards that can be embedded in various apps.
  • Inject custom parameters into your notebook for execution.
  • Ensure that notebook outputs are reproducible.

Additionally, you can use all of the features of Metaflow to manage the execution of notebooks, for example:

  • Managing dependencies (ex: @conda)
  • Requesting compute (ex: @resources)
  • Parallel execution (ex: foreach)
  • etc.

Here is an example of a dashboard generated by a notebook card:

you can see real examples of flows that generate these dashboards in examples

Usage

Step 1: Prepare your notebook

The notebook card injects the following five variables into your notebook:

  1. run_id
  2. step name
  3. task_id
  4. flow_name
  5. pathspec

You can use these variables to retrieve the data you need from a flow. It is recommended that the first cell in your notebook defines these variables and that you designate this cell with the tag "parameters".

For example of this, see tests/nbflow.ipynb:

Note: in the example notebook these variables are set to None however, you can set these variables to real values based on flows that have been previously executed for prototyping.

Step 2: Prepare your flow with the notebook card

You can render cards from notebooks using the @card(type='notebook') decorator on a step. For example, in tests/nbflow.py, the notebook tests/nbflow.ipynb is run and rendered programatically:

from metaflow import step, current, FlowSpec, Parameter, card

class NBFlow(FlowSpec):

    exclude_nb_input = Parameter('exclude_nb_input', default=True, type=bool)

    @step
    def start(self):
        self.data_for_notebook = "I Will Print Myself From A Notebook"
        self.next(self.end)

    @card(type='notebook')
    @step
    def end(self):
        self.nb_options_dict = dict(input_path='nbflow.ipynb', exclude_input=self.exclude_nb_input)

if __name__ == '__main__':
    NBFlow()

Note how the start step stores some data that we want to access from a notebook later. We will discuss how to access this data from a notebook in the next step.

By default, a step that is decorated with @card(type='notebook') expects the variable nb_options_dict to be defined in the step. This variable is a dictionary of arguments that is passed to papermill.execute.notebook. Only the input_path argument is required. If output_path is absent, this is automatically set to _rendered_<run_id>_<step_name>_<task_id>_<your_input_notebook_name>.ipynb.

Furthermore, the exclude_input is an additional boolean argument that specifies whether or not to show our hide cell outputs, which is False by default.

Step 3: Prototype the rest of your notebook

Recall that the run_id, step_name, task_id, flow_name and pathspec are injected into the notebook. We can access this in a notebook using Metaflow's utlities for inspecting Flows and Results. We demonstrate this in tests/nbflow.ipynb:

Some notes about this notebook:

  • We recommend printing the variables injected into the notebook. This can help with debugging and provide an easy to locate lineage.
  • We demonstrate how to access your flow's data via a Step or a Task object. You can read more about the relationship between these objects in these docs. In short, a Task is a child of a Step because a Step can have many tasks (for example if you use a foreach construct for parallelism).
  • We recommend executing a run manually and prototyping the notebook by temporarily supplying the run_id, flow_name, etc to achieve the desired result.

Step 4: Test the card

To test the card in the example outlined above, you must first run the flow (the parenthesis allows the commands to run in a subshell):

(cd tests && python nbflow.py run)

Then, render the card

(cd tests && python nbflow.py card view end)

By default, the cell inputs are hidden when the card is rendered. For learning purposes, it can be useful to render the card with the inputs to validate how the card is executed. You can do this by setting the exclude_nb_input parameter to False that was defined in the flow:

(cd tests && python nbflow.py run --exclude_nb_input=False && python nbflow.py card view end)

Customized Rendering

The @card(type='notebook') is an opinionated way to execute and render notebooks with the tradeoff of requiring significantly less code. While some customization is possible by passing the appropriate arguments to nb_options_dict as listed in papermill.execute.notebook, you can achieve more fine-grained control by executing and rendering the notebook yourself and using the html card. We show an example of this in examples/deep_learning/dl_flow.py:

    @card(type='html')
    @step
    def nb_manual(self):
        """
        Run & Render Jupyter Notebook Manually With The HTML Card.

        Using the html card provides you greater control over notebook execution and rendering.
        """
        import papermill as pm
        output_nb_path = 'notebooks/rendered_Evaluate.ipynb'
        output_html_path = output_nb_path.replace('.ipynb', '.html')

        pm.execute_notebook('notebooks/Evaluate.ipynb',
                            output_nb_path,
                            parameters=dict(run_id=current.run_id,
                                             flow_name=current.flow_name,)
                             )
        run(f'jupyter nbconvert --to html --no-input --no-prompt {output_nb_path}')
        with open(output_html_path, 'r') as f:
            self.html = f.read()
        self.next(self.end)

You can run the following command in your terminal the see output of this step(may take several minutes):

(cd example && python dl_flow.py  run && python dl_flow.py card view nb_manual)

Common Issues

Papermill Arguments

Many issues can be resolved by providing the right arguments to papermill.execute.notebook. Below are some common issues and examples of how to resolve them:

  1. Kernel Name: The name of the python kernel you use locally may be different from your remote execution environment. By default, papermill will attempt to find a kernel name in the metadata of your notebook, which is often automatically created when you select a kernel while running a notebook. You can use the kernel_name argument to specify a kernel. Below is an example:
    @card(type='notebook')
    @step
    def end(self):
        self.nb_options_dict = dict(input_path='nbflow.ipynb', kernel_name='Python3')
  1. Working Directory: The working directory may be important when your notebook is executed, especially if your notebooks rely on certain files or other assets. You can set the working directory the notebook is executed in with the cwd argument, for example, to set the working directory to data/:
    @card(type='notebook')
    @step
    def end(self):
        self.nb_options_dict = dict(input_path='nbflow.ipynb', cwd='data/')

Remote Execution

Dependency Management

If you are running your flow remotely, for example with @batch, you must remember to include the dependencies for this notebook card itself! One way to do this is using pip as illustrated below:

    @card(type='notebook')
    @step
    def end(self):
        import os, sys
        os.system(f"sys.executable -m pip ipykernel>=6.4.1 papermill>=2.3.3 nbconvert>=6.4.1 nbformat>=5.1.3")
        self.nb_options_dict = dict(input_path='nbflow.ipynb')

Note: You can omit the pip install step above if your environment already includes all the dependendencies in your target environment listed in settings.ini. If you do omit pip install, make sure that you pin the correct version numbers as well.

Including Notebook Files In The Context

If you are running steps remotely, you must ensure that your notebooks are uploaded to the remote environment with the cli argument --package-suffixes=".ipynb" For example, to execute examples/deep_learning/dl_flow.py with this argument:

(cd example && python dl_flow.py  --package-suffixes=".ipynb" run)

Examples

We provide several examples of flows that contain the notebook card in examples/.

metaflow-card-notebook's People

Contributors

hamelsmu avatar hugobowne avatar obgibson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

metaflow-card-notebook's Issues

Notebooks in examples?

I've just done the first example here and it works super smoothly and the output looks great.

It isn't obvious where notebooks have been used at all and that seems odd, unless I'm missing something?

I suppose from the example it isn't clear why they are called "notebook" cards?

Notebook Key Interrupted Error

Hello

This is my simple flow
I want to run my notebook first 5-6 cell work correctly
But last few cell always get KeyboardInterrupt: error
My flow dont throw any exception
But when i enter [_rendered_1679995435861952_end_2_model_examle.ipynb] i saw the error

What could be wrong here

Flow Output

2023-03-28 12:23:55.865 Workflow starting (run-id 1679995435861952):
2023-03-28 12:23:55.875 [1679995435861952/start/1 (pid 8060)] Task is starting.
2023-03-28 12:23:56.722 [1679995435861952/start/1 (pid 8060)] Start
2023-03-28 12:23:56.861 [1679995435861952/start/1 (pid 8060)] Task finished successfully.
2023-03-28 12:23:56.871 [1679995435861952/end/2 (pid 8161)] Task is starting.
2023-03-28 12:24:45.329 [1679995435861952/end/2 (pid 8161)] Task finished successfully.
2023-03-28 12:24:45.330 Done!

My Flow

from metaflow import FlowSpec, step, current, Parameter, card, resources


class LinearFlow(FlowSpec):
    
    @step
    def start(self):
        print("Start")
        self.next(self.end)
    
    @card(type='notebook')
    @step
    def end(self):
        self.nb_options_dict = dict(input_path='../notebooks/model_examle.ipynb')


if __name__ == '__main__':
    LinearFlow()

Mistake in README.md

The readme section Dependency Management contains a mistake. Line 5 of the code snippet should be

os.system(f"{sys.executable} -m pip ipykernel>=6.4.1 papermill>=2.3.3 nbconvert>=6.4.1 nbformat>=5.1.3") self.nb_options_dict = dict(input_path='nbflow.ipynb')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.