The jupyterlab-data-explorer's discuss from jupyterlab

Structured URLs

Currently, we have to rely on manually parsing and creating different types of URLs which represent different dataset locations. For example, the notebook URL looks like this:

jupyterlab-data-explorer/dataregistry-extension/src/notebooks.ts

Line 31 in a39a8af

// 'file:///{path}.ipynb/#/cells/{cellid}/outputs/{outputid}/data/{mimetype}'

It would be nice if we could just write a format string that looks like that, and get a way to both generate notebook URLs and extract the data from them. Currently, we have to do something like this instead:

jupyterlab-data-explorer/dataregistry-extension/src/notebooks.ts

Lines 203 to 213 in a39a8af

    
           const result = decodeURIComponent(url.hash).match( 
        
             /^[#]([/]cells[/]\d+[/]outputs[/]\d+)[/]data[/](.*)$/ 
        
           ); 
        
           if ( 
        
             url.protocol !== "file:" || 
        
             !url.pathname.endsWith(".ipynb") || 
        
             !result 
        
           ) { 
        
             return null; 
        
           } 
        
           const [, outputHash, type] = result;

This is error prone and requires duplicating code.

Luckily, there is a "URI Template RFC 6570" standard just for this use case!

We should add support this, using an existing URL template library or writing our own. Ones that look like they might work are:

Design

This is similar to how we created a type safe abstraction over different mimetypes, some of them with arguments:

jupyterlab-data-explorer/dataregistry/src/datatypes.ts

Lines 42 to 66 in a39a8af

    
           export abstract class DataType<T, U> { 
        
             abstract parseMimeType(mimeType: MimeType_): T | typeof INVALID; 
        
             abstract createMimeType(typeData: T): MimeType_; 
        
             createDataset(data: U, typeData: T) { 
        
               return createDataset(this.createMimeType(typeData), data); 
        
             } 
        
             createDatasets(url: URL_, data: U, typeData: T) { 
        
               return createDatasets(url, this.createMimeType(typeData), data); 
        
             } 
        
             /** 
        
              * Filer dataset for mimetypes of this type. 
        
              */ 
        
             filterDataset(dataset: Dataset<any>): Map<T, U> { 
        
               const res = new Map<T, U>(); 
        
               for (const [mimeType, [, data]] of dataset) { 
        
                 const typeData_ = this.parseMimeType(mimeType); 
        
                 if (typeData_ !== INVALID) { 
        
                   res.set(typeData_, data as any); 
        
                 } 
        
               } 
        
               return res; 
        
             } 
        
           }

It lets us define a mimetype once, like this:

jupyterlab-data-explorer/dataregistry-extension/src/notebooks.ts

Lines 65 to 67 in a39a8af

    
           const cellModelDataType = new DataTypeNoArgs<Observable<ICellModel>>( 
        
             "application/x.jupyterlab.cell-model" 
        
           );

and use it in converters to go to/from that mimetype:

jupyterlab-data-explorer/dataregistry-extension/src/notebooks.ts

Lines 75 to 97 in a39a8af

    
           return createConverter( 
        
             { from: resolveDataType, to: cellModelDataType }, 
        
             ({ url }) => { 
        
               const result = url.hash.match(/^[#][/]cells[/](\d+)$/); 
        
               if ( 
        
                 url.protocol !== "file:" || 
        
                 !url.pathname.endsWith(".ipynb") || 
        
                 !result 
        
               ) { 
        
                 return null; 
        
               } 
        
               const cellID = Number(result[1]); 
        
               // Create the original notebook URL and get the cells from it 
        
               url.hash = ""; 
        
               const notebookURL = url.toString(); 
        
               return defer(() => 
        
                 notebookCellsDataType 
        
                   .getDataset(registry.getURL(notebookURL))! 
        
                   .pipe(map(cells => cells[cellID])) 
        
               ); 
        
             } 
        
           );

In a similar fashion, we should be able to create an object that refers to a certain URL template once, and then use it in converters. So we could add an optional fromURL and toURL parameter to createConverter that takes in a URL template template, and so instead of getting/returning an actual URL, you just return the parameters extracted from the template.

So the URLTemplate type, that you pass in, would have to both have the string of the URL template, and have some types that specify the mapping from params to types, so probably an object. So possibly something like this:

const notebookTemplate = new TemplateURL<"path" | "cellID">(
  'file://{/path}.ipynb#/cells/{cellID}',
)

Clean up data browser UI

Pin to right hand side
Preserve viewer in settings

Provide data converters as mimerenders

It should be possible for extension authors to writer a data converter to a widget and have this automatically registered as a mimerenderer.

Seperate style out into it's own files

Allow querying relative URL

If we output a vega spec in a notebook that refers to another cell, the URL to reference it will be relative to the vega notebook, like ./other-data. When we render the vega spec, we should take its current URL and resolve the data in it relative to it.

Show child datasets in browser

Show current value in debugger

We have added a preliminary debugger to see the state of the data registry. We should add a way to see the value of a piece of data using https://github.com/storybookjs/react-inspector and introspecting the observable

Make active datasets follow cell

UI/UX design of data explorer

This is a standing issue to prototype, discuss, and review the user interface and user experience design of the data explorer.

Quilt Data Integration

I wanted to write down some thought on possible ways Quilt Data could integrate with the data registry, to let JupyterLab users explore Quilt Data more easily.

The basic entity is a package, like this: https://quiltdata.com/package/uciml/iris. It is in the form <user>/<name>. As a URL, it could like like:

quilt://quiltdata.com/uciml/iris. So if a user could add a dataset with that, what would they wanna do with it?

See associated metadata provided on quilt website
Insert snippet in Python notebook to read this data (assuming the kernel has quilt installed)
View the files in it? It might be useful to expose each of these as a dataset. I wonder if it would, if we could access them client side to throw in a grid viewer or something.

Vega viewer use data explorer to locate data

If we create a vega spec with a data registry URL in it, the vega embed should resolve this using the data registry.

Datagrid should be full size in data browser

Include cell metadata in registry

We should add output metadata to the data registry

Named outputs from notebooks

We should be able to name the outputs of cells in a notebook, so that we can refer to them by their name. For example, we should be able to output a dataframe with a name, then refer to it by that name in a vega output later on.

Probably depends on #16 first

Create example with getting data from notebook

We should extend our existing notebook example to also how how this could work with a Python API that saves the dataframe as a file and exports that URL.

Add metadata URL for each dataset, map to linked data explorer

Bug: Dataset label doesn't show up.

To reproduce:

Install data metadata extension as well @jupyterlab/metadata-extension
Create a datasets.yml like this https://github.com/jupyterlab/jupyterlab-metadata-service/blob/master/datasets.yml
See datasets.yml in data registry, browse it and you should be able to see its children
Then reload, try first clicking on "Linked Data" button on datasets.yml.
Now when you click "Show" the labels for the children are missing

Make Grid larger in browser

Cell metadata use case

@tonyfast brought up an idea today about how to view metadata about the cell your are looking it. What I understand is that a user would run an output in a cell that changes the output metadata to put some JSON LD in there about the cell. Then, they bring up the linked data browser (jupyterlab/jupyterlab-metadata-service#27 (comment)) and as they scroll through their cells they see the metadata about that cell.

One possible way to achieve this:

Register the cell output metadata in the data registry, like we are registering the cell outputs, as a "json-ld linked data" mimetype (#45)
Define a linked data provider that looks up the URL in the data registry and returns the data from "json-ld linked data" mimetype , if it exists for that URL
Update the active dataset when a user navigates in notebook, to the be the cell the user is on
Change the linked data browser in response to the active dataset changing

Have more nested directories

Try out separating converters and and types into different files

Integration with Variable Inspector extension

xref: jupyterlab-contrib/jupyterlab-variableInspector#95

The Variable Inspector extension already has a capability to view different types of variable such as arrays and dataframes. So as to not duplicate effort it would be useful if that extension could leverage off the capabilities provided here - e.g. it could have a "View in Data Explorer" context menu item which could be hardwired to the double-click action.

User Needs Analysis & User Stories

We had an in-person meeting contributors from NYU and Project Jupyter to define the User Needs for the Data Explorer project. Please keep in mind, it's still a work in progress, and I'll be updating this thread as we have more finalized, but I wanted to open this up for wider input in the meantime. The working document is available at: Dataset Explorer: User Needs Analysis

Why would somebody use a data explorer?

To see a list of the datasets they have opened in JupyterLab or which someone has made available to them through the data registry.
Organize and collate groups of data sets for multi-user projects.
Social usages of data.
Cross-reference related usages of Data sets.
Provide suggestions/recommendations of other datasets that are relevant to their work.
To see different things they can do with those data sets.
To render the data sets using different visual representations (visualization, table, graph, etc.).
To view and edit metadata on datasets.
To encourage people to discuss and produce code snippets for data sets.
To explore data sets available to them for a pre-specified project.
Give a ‘sneak peak’ of data sets.
Browsing datasets by metadata (such as publisher, related datasets, dataset catalogue, etc.).
So they can push cleaned data back to the administrator to update the registry.
To share insights between projects.

Touch Points

Users will often use the Data Explorer:

At the beginning of a project to explore relevant data sets.
After they’ve imported (registered) the data sets relevant to the project. (Phase II)
During work on the project, as new areas of interest arise, researchers will be looking for data to operationalize and integrate these additional areas of interest.
After they’ve been working on the project for a while and want to review data sets.
At project wrap up, to confirm their data sets have been used properly.
While they are working with a data-set.

Relevant Personas

Joe Data Scientist (PI) - Has domain expertise, writes a lot of code.

Jane Data Set Administrator - Works with her developers to register correct datasets along with relevant metadata/comments/code snippets.

Mike Business intelligence analyst - Is doing similar tasks to Data Scientist but doesn’t write so much code.

Project Phases

Phase I

Extensions should register dataset that the user already has approved access to, and which are considered “actively used” by the user.
Assume number of registered datasets is small (<10).

Phase II

Data registry gains notion of different classes of datasets, such as active, available, request access.
Along with this, the UI/UX would need to modified to address the many dataset usage case (100s-1000s). Search, and namespaces would become important.

Data explorer overview

This issue provides an overview of the roadmap of the data explorer.

Background

The JupyterLab data registry will enable extensions to 1) register abstract datasets with a central service and 2) monitor the registry for datasets. The dataset abstraction in the data registry includes:

Text-based MIME type
Optional URI to point at datasets that are persistent
Abstract dataset pointer

The data registry also includes a converter architecture that can convert datasets from one MIME type to another.

Conceptually, the data registry will make datasets a first class entity or noun in JupyterLab.

Data explorer UI

The Data explorer is a proposed user interface to enable users to explore datasets that different extensions have registered with the registry, and the do interesting things with the datasets, such as:

Render them using MIME renderers.
Comment on and annotate the datasets.
Create and view metadata attached to the datasets.

Conceptually, the data explorer UI will provide a user interface for the verbs related to a dataset, or the actions or activities a users can perform with the dataset, such as "render this as a table".

Initial design thoughts

Probably a left sidebar based UI as this is similar to others currently there with an "overview" or "explore" idea.
A list a datasets.
For each dataset a discoverable list of things you can do with the dataset:
- MIME renderers.
- Create/edit metadata.
- Open comments for the dataset.
The metadata and commenting/annotation UIs will likely rely on another extension being developed separately.
We may also want extension points to register new "things you can do" for a given MIME type.
We will want to take into account the MIME type of the dataset, but also the different MIME types that can be created through the converter API.

The visual representation of the list of datasets, and the things you can do with them is still a core design question.

@saulshanabrook @tgeorgeux

Creating context in data registry then opening notebook fails

If we create a notebook model context first using the data regsitry, and then open the notebook, this seems to fail to pick up the kernel.

Reference hdf plugin

We should add a reference to https://github.com/telamonian/jupyterlab-hdf in the readme

Handle renaming datasets

@hoo761 has been doing some work tracking when cells are moved around in https://github.com/jupyterlab/jupyterlab-commenting. Ideally, if you are looking at the output for a cell, and someone moves that cell, then your view should be renamed as well to reflect the new cell number.

To do this, we need to add a concept of renaming to the data registry, to register when one URL should be moved to another and make sure the views handle this properly.

Debugger should restore properly

Currently if you reload when the debugger is open, it isn't always restored.

Updating cell output doesn't output view

Currently re running a cell with an output will give an error.

Voyager Converter

We should rewrite the voyager extension to use the conversion system. This will simplify its data ingestion logic.

Create initial release

The two packages here are ready for an initial release on NPM. I believe @ellisonbg needs to do the honors, since I am not part of the jupyterlab npm org.

Closing widget and reopening fails

Once we close a view, the widget is disposed and we cannot re add it. We should change how the widget views work so they are either not disposed when they close or another is created

Nested Datasets / Data providers

Our current model for the data registry is a global set of datasets that anyone can add datasets too and query for all the available datasets (pretty much what is in this issue #3).

There are a few new ideas floating around that would potentially change this fundamental layout:

Adding the ability for datasets to have "children"
Adding the concept of "data providers"

First, I wanted to layout my current conceptual models to let them guide this change in API:

URLs already have an inherent nested structure to them. For examples file:///a/ is a "parent" of file:///a/b.csv. Also, both file:///a/ and file:///b/ share the same protocol of "file". We could make use of these semantics in the UI, to turn a flat list of URLs into a nested interactive tree view of datasets
Our current converter registry stores how to get to new ways of understanding existing datasets. We could define a converter that is of mimetype xxx.jupyter.datasets that returns a list of "children" datasets.

These conceptual models could be conflicting. Do you want the "tree view" of datasets to be determined by URL or by child/parent status? Basically, I am asking if the child of a dataset has to be related to its parent as a subpath in the URL.

Let's take a step back from these conceptual mind games and make a list of different use cases we are trying to target with the these concepts:

Files that have valid conversions in the file browser show up in the data explorer. Here, the explorer becomes an alternative view on top of the existing file API which exposes how these different files can be used as datasets. Users should be able to expand/unexpand directories to see available datasets
Notebooks should expose all their relevant outputs as datasets. So you should be able to see "inside" a notebook (at least those that are open) to view it's outputs.
If you have a datasets.yml file, that contains references to other datasets, you should be able to explore all of these in the data explorer.
With a quilt data plugin that allows you to explore quilt server, find users, find packages, find datasets inside of them. I would be nice to be able to search for datasets with certain attributes (like all vega lite type datasets), and need a way to paginate/filter because packages could have millions of datasets in them.

Already in these use cases, I have changed the fundamental framing of how datasets end up in the data explorer. Instead of having an imperative style "register/publish" function to add a dataset, there are different "providers" (we can call them) that users can explore and find datasets within. It is moving from a push (user adds datasets explicitly) to a pull (data registry queries provider to see what datasets are available) model.

This is nice, because then we move the state management into these providers. They can figure out when/how to deregister datasets and can implement whatever algorithms they want to list all the available datasets.

A few questions remain:

How do "providers" interact with URLs? Is there any correspondence or can any provider return any URL?
What is the API for a provider? How does it relate to nesting?

To answer these, I think it would be helpful to sketch out some different possibly user experiences for the data registry with nested data / data providers that covers the use cases we care about. This will inform what API makes sense to require for the provider and how providers relate to URLs. For example, if we have a nested structure that allows user to fold/unfold at parts of the URL path, then the provider should have something like queryChildren(basePath: string): Datasets. If we need pagination, then this response also should be paginated.

Support relative linking to relative file paths in notebook

From a notebook we should be able to output a link to a local file path and have that show up in the data registry.

This is useful if you want to save a dataframe as a CSV file then load it on the frontend.

Add documentation

Before the first release, we should do some documentation. It should focus on three possibly users:

Users of data explorer
- How to install
- How to use
- Built in supported types and viewers
Using data registry outside JupyterLab
- How to depend on package
- How to register converter
- How to query data registry to get conversions, caching
- How to create typed data type
- Examples of different data types and conversions,
  - Observable based conversion with caching
  - Nested datasets
- Built in data types and their relations
Extenders of data registry in JL
- How plugins grab the data registry and register conversions

Currently I am thinking this can just go in markdown.

Intake Integration

Intake is a "lightweight package for finding, investigating, loading and disseminating data." It would be nice to figure out how the JupyterLab data registry could integrate with this package.

Catalogs

Having JupyterLab be aware of Intake's "Data catalogs" are probably a good place to start. They "provide an abstraction that allows you to externally define, and optionally share, descriptions of datasets, called catalog entries."

Local

For example, if you have a catalog as a file on disk in a catalog.yaml file, we might want to be able to see the datasets it defines in the data registry. This is similar to how currently if you have a .ipynb file, you can view the the datasets in its cell outputs. To do this, we would have to be able to parse it's YAML format in javascript, and map the different entries to URLs.

For example, this catalog.yml file:

metadata:
  version: 1
sources:
  example:
    description: test
    driver: random
    args: {}

  entry1_full:
    description: entry1 full
    metadata:
      foo: 'bar'
      bar: [1, 2, 3]
    driver: csv
    args: # passed to the open() method
      urlpath: '{{ CATALOG_DIR }}/entry1_*.csv'

  entry1_part:
    description: entry1 part
    parameters: # User parameters
      part:
        description: section of the data
        type: str
        default: "stable"
        allowed: ["latest", "stable"]
    driver: csv
    args:
      urlpath: '{{ CATALOG_DIR }}/entry1_{{ part }}.csv'

Might map to a number of nested URLs:

./dataset.yml#/sources/example
./dataset.yml#/sources/entry1_full
./dataset.yml#/sources/entry1_part

And the ones that point to CSV files, would also point to some nested URLs, like dataset.yml#/sources/entry1_part would point to:

./entry1_latest.csv
./entry1_stable.csv

This basically requires re-implementing the logic of the all the drivers, so that they can work client side.

Remote

We could also support loading a remote Intake data catalog. If you loaded a URL like intake://catalog1:5000 in the data registry you would want to be able to see the datasets available. Here, the proxy mode might be useful:

Proxied access: In this mode, the catalog server uses its local drivers to open the data source and stream the data over the network to the client. The client does not need any special drivers to read the data, and can read data from files and data servers that it cannot access, as long as the catalog server has the required access.

If we implement a client API for this server protocol, then we can let it handle all the data parsing and just expose the results it returns to the user. We would have to look more in depth in its specification.

Adding support for HDF5: feasible?

I'm looking into whether I can build on this extension in order to implement a long-held ambition: an HDF5 file viewer for Jlab.

An HDF5 file is kind of like its own mini filesystem: there's a tree of groups (equivalent to directories), and each group may contain datasets and/or other groups. My basic idea would be to expose the group tree in the dataset browser, and then be able to open/view a given dataset (assuming it's 2D or less) as a grid in the main area.

Along those lines, I have a bunch of questions:

Does this seem feasible/in line with your vision of this extension?
What exactly would I need to add to this extension in order to support HDF5? It looks like I'd at least need to add an appropriate converter in dataregistry-extension/src/files.tsx. Would anything else be required?
Does this extension have any stuff to help deal with large files?
It seems that the text in the grid view is not selectable, which is a little bit killer. The grid is implemented via the datagrid stuff from Phosphor, right? Is this an upstream issue?

Initial repository setup

This is an issue to track the initial repository setup:

Add official Jupyter copyright and license.
Add standard sections to README, including team section.
Create skeleton for single npm package
Create JLab style labels.

Dummy: For photo uploads

Drag photos into comments on this issue to upload them and get a URL. Then you can link to them from the docs/readme

Add linked data viewer

GraphQL Viewer

@tonyfast mentioned it would be useful to register a GraphQL viewer for GraphQL files.

Show mimeType graph

At the meeting today @ellisonbg and @tonyfast mentioned that it might be nice to be able to actually see the graph of the mimetypes that is generated during the conversion process.

To do this, we would need to save some extra context.

Filtering and Searching

We should investigate filtering and searching, both how these things should be done on the data registry level and how they should be displayed in the UX.

We need to be careful how specific we get on these interfaces to both support a wide set of use cases but keep the UI manageable.

@ellisonbg mentioned we might want to provide some built in filtering/searching just for in memory data, like possibly everything we show in the UX. And then more advanced filtering/searching could be implemented in an alternative UI if that is desired.

So in the default UI, maybe just show "activities" and nesting, and allow searching/filtering over visible URLs and activity names.

But we should explore this further, maybe there is some lazy filtering/searching we should support builtin.

Reference from @pacoid: https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9 It would be good to think about how support this kind of data registry

Fix dissapearing nteract Data Explorer after a second

Improve expand/collapse UI

Right now there is a "show"/"hide" button to deal with nested datasets. Instead this should probably be an arrow on the right side.

View Notebook Ouputs in Data Explorer

In order to support the previous integration we had with nteract's data explorer, we should be able to output a table mimetype from a notebook and view it with their data explorer by first registering it in ours. So we should try out enabling notebooks outputs as nested datasets.

I think they will be of the URL file://filename.ipynb#cells/123.

Notebook models contexts are not cleaned up

We should look into cleaning up notebook contexts once we don't use them anymore

Dark theme support

None of the components support the dark theme. We need to use the JupyterLab CSS variables or the theme signal to track which theme we are on and color things appropriately.

Implement react components for data explorer

The data explorer UI will be implemented as a set of react components. This is a placeholder for discussions related to that part of the work.

Add converters for tabular data

@ellisonbg mentioned that it would be good to support some default tabular data formats, to convert between them.

~~CSV string~~ Done
~~JSON table schema~~ Done
inline format that Vega supports (basically list of objects)
Datagrid model object (should be extracted from current CSV viewer implementation).

For each of these, we should define a data type, and define converters between them. Then we should make sure they work on some test datasets.

Some pipelines that should work after this:

Open CSV files with nteract data viewer, by first converting to JSON table schema
View pandas dataframe output in datagrid, by going from JSON table schema to datagrid model
If we create a Vega Lite spec that refers to a dataset by url like file:///notebooks/Table.ipynb#/cells/4/outputs/0/data/application/vnd.dataresource+json, then this should use the pandas output from that cell in the notebook as an input to the vega spec. Depends on #20

Icons for common MIME Types available in Data Registry

I'm looking to make a list of common MIME types we'll need icons for.

text/csv
application/rdf+xml
text/richtext
application/rss+xml
application/sparql-query
application/json
application/x-latex

Do we have any need to represent data or video files at this point? Do any of those in the not make sense to include for now? Are there any obvious types missing?

I don't have a good feel for what MIME types will be considered 'common' in this use case, please help me populate this list.

	const result = decodeURIComponent(url.hash).match(
	/^[#]([/]cells[/]\d+[/]outputs[/]\d+)[/]data[/](.*)$/
	);
	if (
	url.protocol !== "file:" \|\|
	!url.pathname.endsWith(".ipynb") \|\|
	!result
	) {
	return null;
	}
	const [, outputHash, type] = result;

	export abstract class DataType<T, U> {
	abstract parseMimeType(mimeType: MimeType_): T \| typeof INVALID;
	abstract createMimeType(typeData: T): MimeType_;

	createDataset(data: U, typeData: T) {
	return createDataset(this.createMimeType(typeData), data);
	}
	createDatasets(url: URL_, data: U, typeData: T) {
	return createDatasets(url, this.createMimeType(typeData), data);
	}

	/**
	* Filer dataset for mimetypes of this type.
	*/
	filterDataset(dataset: Dataset<any>): Map<T, U> {
	const res = new Map<T, U>();
	for (const [mimeType, [, data]] of dataset) {
	const typeData_ = this.parseMimeType(mimeType);
	if (typeData_ !== INVALID) {
	res.set(typeData_, data as any);
	}
	}
	return res;
	}
	}

	const cellModelDataType = new DataTypeNoArgs<Observable<ICellModel>>(
	"application/x.jupyterlab.cell-model"
	);

	return createConverter(
	{ from: resolveDataType, to: cellModelDataType },
	({ url }) => {
	const result = url.hash.match(/^[#][/]cells[/](\d+)$/);
	if (
	url.protocol !== "file:" \|\|
	!url.pathname.endsWith(".ipynb") \|\|
	!result
	) {
	return null;
	}
	const cellID = Number(result[1]);

	// Create the original notebook URL and get the cells from it
	url.hash = "";
	const notebookURL = url.toString();
	return defer(() =>
	notebookCellsDataType
	.getDataset(registry.getURL(notebookURL))!
	.pipe(map(cells => cells[cellID]))
	);
	}
	);

jupyterlab / jupyterlab-data-explorer Goto Github PK

jupyterlab-data-explorer's Issues

Design

Why would somebody use a data explorer?

Touch Points

Relevant Personas

Project Phases

Phase I

Phase II

Background

Data explorer UI

Initial design thoughts

Catalogs

Local

Remote

Recommend Projects

Recommend Topics

Recommend Org