Giter Site home page Giter Site logo

visdesignlab / persist Goto Github PK

View Code? Open in Web Editor NEW
27.0 5.0 3.0 28.77 MB

Persist is a JupyterLab extension to enable persistent interactive visualizations in JupyterLab notebooks.

License: BSD 3-Clause "New" or "Revised" License

JavaScript 1.06% Python 34.34% TypeScript 64.13% CSS 0.06% Shell 0.40%
altair jupyterlab jupyterlab-extension provenance provenance-tracking vegalite

persist's Introduction

Persist

Persistent and Reusable Interactions in Computational Notebooks

Binder

This repository contains source code for Persist (PyPi) extension.

Persist is a JupyterLab extension to enable persistent interactive outputs in JupyterLab notebooks. Check out the introductory video below.

Persist.Introduction.mp4

Watch on Youtube with CC

Publication

Persist is developed as part of a publication and will appear in EuroVis 2024.

Teaser image from the pre-print. The figure describes the workflow showing high level working of Persist technique.

Supplementary Material

Supplementary material including example notebooks, walkthrough notebooks, notebooks used in the study (including participant notebooks) and the analysis notebooks can be accessed here.

Abstract

Computational notebooks, such as Jupyter, support rich data visualization. However, even when visualizations in notebooks are interactive, they still are a dead end: Interactive data manipulations, such as selections, applying labels, filters, categorizations, or fixes to column or cell values, could be efficiently apply in interactive visual components, but interactive components typically cannot manipulate Python data structures. Furthermore, actions performed in interactive plots are volatile, i.e., they are lost as soon as the cell is re-run, prohibiting reusability and reproducibility. To remedy this, we introduce Persist, a family of techniques to capture and apply interaction provenance to enable persistence of interactions. When interactions manipulate data, we make the transformed data available in dataframes that can be accessed in downstream code cells. We implement our approach as a JupyterLab extension that supports tracking interactions in Vega-Altair plots and in a data table view. Persist can re-execute the interaction provenance when a notebook or a cell is re-executed enabling reproducibility and re-use.

We evaluated Persist in a user study targeting data manipulations with 11 participants skilled in Python and Pandas, comparing it to traditional code-based approaches. Participants were consistently faster with Persist, were able to correctly complete more tasks, and expressed a strong preference for Persist.

Persist and Vega-Altair charts

Persist works with Vega-Altair charts directly for the most part. Vega-Altair and Vega-Lite offer multiple ways to write a specification. However Persist has certain requirements that need to be fulfilled.

  • The selection parameters in the chart should be named. Vega-Altair's default behavior is to generate a name of selection parameter with auto-incremented numeric suffix. The value of the generated selection parameter keeps incrementing on subsequent re-executions of the cell. Persist relies on consistent names to replay the interactions, and passing the name parameter fixes allows Persist to work reliably.

  • The point selections should have at least the fields attribute specified. Vega-Altair supports selections without fields by using the auto-generated indices to define selections. The indices are generated with the default order of rows in the source dataset. Using the indices directly for selection can cause Persist to operate on incorrect rows if the source dataset order changes.

  • Dealing with datetime in Pandas is challenging. To standardize the way datetime conversion takes place within VegaLite and within Pandas when using Vega-Altair, the TimeUnit transforms and encodings must be specified in UTC. e.g month(Date) should be utcmonth(Date).

Requirements

- JupyterLab >= 4.0.0 or Jupyter Notebook >= 7.0.0
- pandas >= 0.25
- altair >= 5
- ipywidgets
- anywidget

Install

To install the extension, execute:

pip install persist_ext

If the Jupyter server is running, you might have to reload the browser page and restart the kernel.

Uninstall

To remove the extension, execute:

pip uninstall persist_ext

Contributing

Persist uses hatch to manage the development, build and publish workflows. You can install hatch using pipx, pip or Homebrew (on MacOS or Unix).

pipx

Install hatch globally in isolated environment. We recommend this way.

pipx install hatch
pip

Install hatch in the current Python environment.

WARNING: This may change the system Python installation.

pip install hatch
Homebrew
pip install hatch

Jupyter extensions use a custom version of yarn package manager called jlpm. When any relevant command is run, hatch should automatically install and setup up jlpm. After installing hatch with your preferred method follow instructions below for workflow you want. We prefix all commands with hatch run to ensure they are run in proper environments.

Development

Run the setup script from package.json:

hatch run jlpm setup

When setup is completed, open three terminal windows and run the follow per terminal.

Widgets

Setup vite dev server to build the widgets

hatch run watch_widgets

Extension

Start dev server to watch and build the extension

hatch run watch_extension

Lab

Run JupyterLab server with minimize flag set to false, which gives better stack traces aqnd debugging experience.

hatch run run_lab

Build

To build the extension as a standalone Python package, run:

hatch run build_extension

Publish

To publish the extension, first we create a proper version. We can run any of the following

hatch version patch # x.x.1
hatch version minor # x.1.x
hatch version major # 1.x.x

You can also append release candidate label:

hatch version rc

Finally you can directly specify the exact version:

hatch version "1.3.0"

Once the proper version is set, build the extension using the build workflow.

When the build is successful, you can publish the extension if you have proper authorization:

hatch publish

Acknowledgements

The widget architecture of Persist is created using anywidget projects.

The interactive visualizations used by Persist are based on the excellent, Vega-Lite and Vega-Altair projects. Specifially the implementation of JupyterChart class in Vega-Altair was of great help in understanding how Vega-Altair chart can be turned into a widget. We gratefully acknowledge funding from the National Science Foundation (IIS 1751238 and CNS 213756).

persist's People

Contributors

kirangadhave avatar zachcutler04 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

persist's Issues

Create new column by example

Once the intent library is integrated into the extension, we should explore the creation of new columns using the inferred intents.

Consider using just selection/ranges before that.

operation

Add sort operation

sorting individual values might be useful for bar charts and other categorical visualizations.

sorting by column names would be useful in repeat charts and scatterplot matrix (also a repeat chart)

operation

Aggregate

Write up how interactions should work

Indicate aggregate points

Add "label" and "notes" buttons / operations

Label

This operation will let us label the selection with a shortish string.

Notes

This operation will allow attaching of arbitrary notes to the selection.

Discussion:

Both operations will add new columns to the dataset: __label & __notes.

Should we support multiple labels/notes for a single row? For now no support for the label, maybe for notes.

Generate code to access dataframe

Each node in the Trrack graph should have a pointer to a dataframe. The dataframes can be saved in Python dict with trrack node ids as keys and names formatted like df_{first 6 digits of trrack id}.

Clicking on the copy button should generate the dataframe in memory if it doesn't exist and copy a code snippet to retrieve the dataframe. E.g. IDE.get_df(<dataframe name>)

The dataframe should be usable in downstream code as usual. On kernel restart the dataframe wouldn't exist in memory, so the code using the variable should fail. So the cell with trrack interaction needs to be re-run. This is expected even in basic python code.

Update

There are two ways to generate the dataframe. Create dynamic dataframe generates a dynamic dataframe. This variable is kept in sync with the current node in the provenance graph. On restarting the kernel running the cell generates the variable.

Another way to create a dataframe is to use copy button on the provenance graph. This dataframe is associated with that specific node.
You must go to the node at least once after the kernel reset.

Add error reporting during auto-reapply process

We are treating each interaction in provenance a code snippet. The interactions are reapplied when the code cell is run. In case the dataset variable changes, this might result in errors. Currently we have errors/warnings attribute in TrrackableCell class, but we don't utilize it.

  • Create proper error messages and store in cell metadata.
  • Associate the errors with correct provenance node.
  • Show the errors in the provenance graph.
  • Move current to the previous node.

Add drop column operation

should be able to drop/delete a column from the dataset in repeat visualizations which show different columns in different charts. e.g., SPLOM

operation

also invert of this operation, keep selected columns and delete everything else.
operation

Add aggregate interaction

Aggregating selected points on the same chart is not supported by Vegalite/Altair.

This is important for interactively transforming the data. Clicking on aggregate buttons should create an aggregate transform for all active selections. This transform is added to the transforms array and overlay logic (TODO: expand) to display the aggregate. Finally, we should remove the selections.

For generating code at this node, we will load the dataframe, do a group by query, and show the grouped dataframe instead.

TODO:

  • How to determine aggregation op for each column?
  • Should the get_df function accept an arg for flattening? If true, we return the original data frame with a new column instead.
  • The new column should have values None or trrack_agg_#. get_df can take an arg to change the prefix when loading the dataframe in Python

Give Aggregates meaningful default names

Currently, aggregates have a generated name in the format: df_short-trrack-uuid. Can we generate a meaningful default name?

It should also allow editing of the aggregate name later.

Persist the dataframes without extension

If the IDE extension is not loaded, the df_.... (see #8) variables fail.

There is no recovery from this failure unless the IDE extension is loaded. We can generate a pandas code snippet that recreates the original variable.

How we show the snippet will matter a lot. The straightforward way is to copy the snippet as a Python comment similar to the dataframe access code, like in #8.

Another ideal solution would be to store the generated code with a different mime/type in the cell metadata and show it when the cell execution throws an error due to an undefined variable.

Pre-pilot/Pilot

We can recruit from our lab. Jack has some experience with notebooks

Testing compound charts

Test various chart types like in #33.

Selection Filter Aggregate
Single - Scatterplot P P P
Single - Line/Area chart
Single - Bar chart P P P
Single - SPLOM
Single - Heatmaps P P P
Layered - Scatterplot P P P
Layered - Line/Area chart
Layered - Bar chart P P P
Layered - SPLOM
Layered - Heatmaps P P P
Repeat - Scatterplot P * P P
Repeat - Line/Area chart
Repeat - Bar chart P P P
Repeat - SPLOM
Repeat - Heatmaps P P P
Faceted - Scatterplot P P P
Faceted - Line/Area chart
Faceted - Bar chart P P P
Faceted - SPLOM
Faceted - Heatmaps P P P
HConcat - Scatterplot P P P
HConcat - Line/Area chart
HConcat - Bar chart P P P
HConcat - SPLOM
HConcat - Heatmaps P P P
VConcat - Scatterplot P P
VConcat - Line/Area chart P
VConcat - Bar chart P P P
VConcat - SPLOM
VConcat - Heatmaps P P P

Add scale min/max value operation

Scale a numerical column between a minimum and maximum value.

Uses:
Clamping - add a new column with an annotation that contains the real value but changes the data value. E.g., (Clamped from 37)

Name the new column as <og_col_name>_scaled/clamped.

operation

Add better aggregate controls

We have a single Aggregate button which aggregates with the mean operation. It would be useful to have multiple controls for options like: Aggregate Sum, Aggregate Mean, Aggregate Group-By.

Aggregate Group-by creates the group-by object without any operation specified. In terms of visualization, we can show this by a convex hull (maybe?) instead of adding the aggregate point and removing the original points.

Add label change operations.

we should have operations to change labels for rows or columns.

Row

Should we specify the label column? If there is no label column, should we add one?

Column

I assume this is the same as #55?

Study design

The minimum evaluation we want is user feedback on the usefulness of our techniques/plugins.
Good to have - Quantitative study

Qualitative

Participants: ~5
Discuss potential candidates with Klaus.
Dr. Kogan has a class with students using notebooks (will they have their data?)

  • IRB
  • Brief interview about their existing notebook workflow
  • Discussion on pain points
  • Brief introduction to the technique
  • A short demo (maybe a video?)
  • Participants load one of their notebooks in Jupyter with IDE installed.
    OR
  • Start a new analysis with their data.
    OR
  • Task with an analysis to do with our data
  • (Re)analyze with trrack support + interactive visualizations.
  • Feedback interview
    • experience (learning curve, difficulty, etc.)?
    • were any pain points addressed
    • what did you like/dislike

#34 - for tasks

Quantitative

Participants: not sure
Crowdsourced on prolific
Dr. Kogan's class (might be able to do in-person study)

  • IRB
  • Demo Video
  • Tasks
    • Perform analysis in two notebooks: regular jupyterlab & IDE-enhanced jupyterlab
    • Similar tasks with different dataset
    • Goal is to get time to completion and accuracy
  • Feedback Questionnaire

Add categorize operation

Currently, aggregate adds a new column called __aggregate to attach an aggregate name to each row. We should have an explicit Categorize operation that lets us add a __cateogry column that can be named.

Only one category for the CHI prototype, but ideally should allow adding any number of category columns.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.