jupyterlab / jupyterlab-data-explorer Goto Github PK
View Code? Open in Web Editor NEWFirst class datasets in JupyterLab
License: BSD 3-Clause "New" or "Revised" License
First class datasets in JupyterLab
License: BSD 3-Clause "New" or "Revised" License
Currently, we have to rely on manually parsing and creating different types of URLs which represent different dataset locations. For example, the notebook URL looks like this:
It would be nice if we could just write a format string that looks like that, and get a way to both generate notebook URLs and extract the data from them. Currently, we have to do something like this instead:
jupyterlab-data-explorer/dataregistry-extension/src/notebooks.ts
Lines 203 to 213 in a39a8af
This is error prone and requires duplicating code.
Luckily, there is a "URI Template RFC 6570" standard just for this use case!
We should add support this, using an existing URL template library or writing our own. Ones that look like they might work are:
This is similar to how we created a type safe abstraction over different mimetypes, some of them with arguments:
jupyterlab-data-explorer/dataregistry/src/datatypes.ts
Lines 42 to 66 in a39a8af
It lets us define a mimetype once, like this:
jupyterlab-data-explorer/dataregistry-extension/src/notebooks.ts
Lines 65 to 67 in a39a8af
and use it in converters to go to/from that mimetype:
jupyterlab-data-explorer/dataregistry-extension/src/notebooks.ts
Lines 75 to 97 in a39a8af
In a similar fashion, we should be able to create an object that refers to a certain URL template once, and then use it in converters. So we could add an optional fromURL
and toURL
parameter to createConverter
that takes in a URL template template, and so instead of getting/returning an actual URL, you just return the parameters extracted from the template.
So the URLTemplate
type, that you pass in, would have to both have the string of the URL template, and have some types that specify the mapping from params to types, so probably an object. So possibly something like this:
const notebookTemplate = new TemplateURL<"path" | "cellID">(
'file://{/path}.ipynb#/cells/{cellID}',
)
It should be possible for extension authors to writer a data converter to a widget and have this automatically registered as a mimerenderer.
If we output a vega spec in a notebook that refers to another cell, the URL to reference it will be relative to the vega notebook, like ./other-data
. When we render the vega spec, we should take its current URL and resolve the data in it relative to it.
We have added a preliminary debugger to see the state of the data registry. We should add a way to see the value of a piece of data using https://github.com/storybookjs/react-inspector and introspecting the observable
This is a standing issue to prototype, discuss, and review the user interface and user experience design of the data explorer.
I wanted to write down some thought on possible ways Quilt Data could integrate with the data registry, to let JupyterLab users explore Quilt Data more easily.
The basic entity is a package, like this: https://quiltdata.com/package/uciml/iris. It is in the form <user>/<name>
. As a URL, it could like like:
quilt://quiltdata.com/uciml/iris
. So if a user could add a dataset with that, what would they wanna do with it?
If we create a vega spec with a data registry URL in it, the vega embed should resolve this using the data registry.
We should add output metadata to the data registry
We should be able to name the outputs of cells in a notebook, so that we can refer to them by their name. For example, we should be able to output a dataframe with a name, then refer to it by that name in a vega output later on.
Probably depends on #16 first
We should extend our existing notebook example to also how how this could work with a Python API that saves the dataframe as a file and exports that URL.
To reproduce:
@jupyterlab/metadata-extension
@tonyfast brought up an idea today about how to view metadata about the cell your are looking it. What I understand is that a user would run an output in a cell that changes the output metadata to put some JSON LD in there about the cell. Then, they bring up the linked data browser (jupyterlab/jupyterlab-metadata-service#27 (comment)) and as they scroll through their cells they see the metadata about that cell.
One possible way to achieve this:
Try out separating converters and and types into different files
xref: jupyterlab-contrib/jupyterlab-variableInspector#95
The Variable Inspector extension already has a capability to view different types of variable such as arrays and dataframes. So as to not duplicate effort it would be useful if that extension could leverage off the capabilities provided here - e.g. it could have a "View in Data Explorer" context menu item which could be hardwired to the double-click action.
We had an in-person meeting contributors from NYU and Project Jupyter to define the User Needs for the Data Explorer project. Please keep in mind, it's still a work in progress, and I'll be updating this thread as we have more finalized, but I wanted to open this up for wider input in the meantime. The working document is available at: Dataset Explorer: User Needs Analysis
Users will often use the Data Explorer:
Joe Data Scientist (PI) - Has domain expertise, writes a lot of code.
Jane Data Set Administrator - Works with her developers to register correct datasets along with relevant metadata/comments/code snippets.
Mike Business intelligence analyst - Is doing similar tasks to Data Scientist but doesn’t write so much code.
This issue provides an overview of the roadmap of the data explorer.
The JupyterLab data registry will enable extensions to 1) register abstract datasets with a central service and 2) monitor the registry for datasets. The dataset abstraction in the data registry includes:
The data registry also includes a converter architecture that can convert datasets from one MIME type to another.
Conceptually, the data registry will make datasets a first class entity or noun in JupyterLab.
The Data explorer is a proposed user interface to enable users to explore datasets that different extensions have registered with the registry, and the do interesting things with the datasets, such as:
Conceptually, the data explorer UI will provide a user interface for the verbs related to a dataset, or the actions or activities a users can perform with the dataset, such as "render this as a table".
The visual representation of the list of datasets, and the things you can do with them is still a core design question.
If we create a notebook model context first using the data regsitry, and then open the notebook, this seems to fail to pick up the kernel.
We should add a reference to https://github.com/telamonian/jupyterlab-hdf in the readme
@hoo761 has been doing some work tracking when cells are moved around in https://github.com/jupyterlab/jupyterlab-commenting. Ideally, if you are looking at the output for a cell, and someone moves that cell, then your view should be renamed as well to reflect the new cell number.
To do this, we need to add a concept of renaming to the data registry, to register when one URL should be moved to another and make sure the views handle this properly.
Currently if you reload when the debugger is open, it isn't always restored.
Currently re running a cell with an output will give an error.
We should rewrite the voyager extension to use the conversion system. This will simplify its data ingestion logic.
The two packages here are ready for an initial release on NPM. I believe @ellisonbg needs to do the honors, since I am not part of the jupyterlab npm org.
Once we close a view, the widget is disposed and we cannot re add it. We should change how the widget views work so they are either not disposed when they close or another is created
Our current model for the data registry is a global set of datasets that anyone can add datasets too and query for all the available datasets (pretty much what is in this issue #3).
There are a few new ideas floating around that would potentially change this fundamental layout:
First, I wanted to layout my current conceptual models to let them guide this change in API:
file:///a/
is a "parent" of file:///a/b.csv
. Also, both file:///a/
and file:///b/
share the same protocol of "file". We could make use of these semantics in the UI, to turn a flat list of URLs into a nested interactive tree view of datasetsxxx.jupyter.datasets
that returns a list of "children" datasets.These conceptual models could be conflicting. Do you want the "tree view" of datasets to be determined by URL or by child/parent status? Basically, I am asking if the child of a dataset has to be related to its parent as a subpath in the URL.
Let's take a step back from these conceptual mind games and make a list of different use cases we are trying to target with the these concepts:
datasets.yml
file, that contains references to other datasets, you should be able to explore all of these in the data explorer.Already in these use cases, I have changed the fundamental framing of how datasets end up in the data explorer. Instead of having an imperative style "register/publish" function to add a dataset, there are different "providers" (we can call them) that users can explore and find datasets within. It is moving from a push (user adds datasets explicitly) to a pull (data registry queries provider to see what datasets are available) model.
This is nice, because then we move the state management into these providers. They can figure out when/how to deregister datasets and can implement whatever algorithms they want to list all the available datasets.
A few questions remain:
To answer these, I think it would be helpful to sketch out some different possibly user experiences for the data registry with nested data / data providers that covers the use cases we care about. This will inform what API makes sense to require for the provider and how providers relate to URLs. For example, if we have a nested structure that allows user to fold/unfold at parts of the URL path, then the provider should have something like queryChildren(basePath: string): Datasets
. If we need pagination, then this response also should be paginated.
From a notebook we should be able to output a link to a local file path and have that show up in the data registry.
This is useful if you want to save a dataframe as a CSV file then load it on the frontend.
Before the first release, we should do some documentation. It should focus on three possibly users:
Currently I am thinking this can just go in markdown.
Intake is a "lightweight package for finding, investigating, loading and disseminating data." It would be nice to figure out how the JupyterLab data registry could integrate with this package.
Having JupyterLab be aware of Intake's "Data catalogs" are probably a good place to start. They "provide an abstraction that allows you to externally define, and optionally share, descriptions of datasets, called catalog entries."
For example, if you have a catalog as a file on disk in a catalog.yaml
file, we might want to be able to see the datasets it defines in the data registry. This is similar to how currently if you have a .ipynb
file, you can view the the datasets in its cell outputs. To do this, we would have to be able to parse it's YAML format in javascript, and map the different entries to URLs.
For example, this catalog.yml
file:
metadata:
version: 1
sources:
example:
description: test
driver: random
args: {}
entry1_full:
description: entry1 full
metadata:
foo: 'bar'
bar: [1, 2, 3]
driver: csv
args: # passed to the open() method
urlpath: '{{ CATALOG_DIR }}/entry1_*.csv'
entry1_part:
description: entry1 part
parameters: # User parameters
part:
description: section of the data
type: str
default: "stable"
allowed: ["latest", "stable"]
driver: csv
args:
urlpath: '{{ CATALOG_DIR }}/entry1_{{ part }}.csv'
Might map to a number of nested URLs:
./dataset.yml#/sources/example
./dataset.yml#/sources/entry1_full
./dataset.yml#/sources/entry1_part
And the ones that point to CSV files, would also point to some nested URLs, like dataset.yml#/sources/entry1_part
would point to:
./entry1_latest.csv
./entry1_stable.csv
This basically requires re-implementing the logic of the all the drivers, so that they can work client side.
We could also support loading a remote Intake data catalog. If you loaded a URL like intake://catalog1:5000
in the data registry you would want to be able to see the datasets available. Here, the proxy mode might be useful:
Proxied access: In this mode, the catalog server uses its local drivers to open the data source and stream the data over the network to the client. The client does not need any special drivers to read the data, and can read data from files and data servers that it cannot access, as long as the catalog server has the required access.
If we implement a client API for this server protocol, then we can let it handle all the data parsing and just expose the results it returns to the user. We would have to look more in depth in its specification.
I'm looking into whether I can build on this extension in order to implement a long-held ambition: an HDF5 file viewer for Jlab.
An HDF5 file is kind of like its own mini filesystem: there's a tree of groups (equivalent to directories), and each group may contain datasets and/or other groups. My basic idea would be to expose the group tree in the dataset browser, and then be able to open/view a given dataset (assuming it's 2D or less) as a grid in the main area.
Along those lines, I have a bunch of questions:
dataregistry-extension/src/files.tsx
. Would anything else be required?This is an issue to track the initial repository setup:
Drag photos into comments on this issue to upload them and get a URL. Then you can link to them from the docs/readme
@tonyfast mentioned it would be useful to register a GraphQL viewer for GraphQL files.
At the meeting today @ellisonbg and @tonyfast mentioned that it might be nice to be able to actually see the graph of the mimetypes that is generated during the conversion process.
To do this, we would need to save some extra context.
We should investigate filtering and searching, both how these things should be done on the data registry level and how they should be displayed in the UX.
We need to be careful how specific we get on these interfaces to both support a wide set of use cases but keep the UI manageable.
@ellisonbg mentioned we might want to provide some built in filtering/searching just for in memory data, like possibly everything we show in the UX. And then more advanced filtering/searching could be implemented in an alternative UI if that is desired.
So in the default UI, maybe just show "activities" and nesting, and allow searching/filtering over visible URLs and activity names.
But we should explore this further, maybe there is some lazy filtering/searching we should support builtin.
Reference from @pacoid: https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9 It would be good to think about how support this kind of data registry
Right now there is a "show"/"hide" button to deal with nested datasets. Instead this should probably be an arrow on the right side.
In order to support the previous integration we had with nteract's data explorer, we should be able to output a table mimetype from a notebook and view it with their data explorer by first registering it in ours. So we should try out enabling notebooks outputs as nested datasets.
I think they will be of the URL file://filename.ipynb#cells/123
.
We should look into cleaning up notebook contexts once we don't use them anymore
None of the components support the dark theme. We need to use the JupyterLab CSS variables or the theme signal to track which theme we are on and color things appropriately.
The data explorer UI will be implemented as a set of react components. This is a placeholder for discussions related to that part of the work.
@ellisonbg mentioned that it would be good to support some default tabular data formats, to convert between them.
For each of these, we should define a data type, and define converters between them. Then we should make sure they work on some test datasets.
Some pipelines that should work after this:
file:///notebooks/Table.ipynb#/cells/4/outputs/0/data/application/vnd.dataresource+json
, then this should use the pandas output from that cell in the notebook as an input to the vega spec. Depends on #20I'm looking to make a list of common MIME types we'll need icons for.
Do we have any need to represent data or video files at this point? Do any of those in the not make sense to include for now? Are there any obvious types missing?
I don't have a good feel for what MIME types will be considered 'common' in this use case, please help me populate this list.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.