Giter Site home page Giter Site logo

Comments (8)

martindurant avatar martindurant commented on September 28, 2024 2

Create new repo for this work

I have no preference where this lives. On jupyterlab or other related org or in Intake, all are fine.

from jupyterlab-data-explorer.

saulshanabrook avatar saulshanabrook commented on September 28, 2024

I am chatting with @danielballan about this issue. We have come up with a plan!

Intake discovers catalogues in the system by looking at certain paths for .yml files. There is an open issue (intake/intake#404) to also discover catalogues in Python packages via an entry point intake.catalogues.

So what we can do is launch an Intake server as a Jupyter server proxy that serves up Intake's HTTP API. We can connect to this in a JupyterLab plugin and register a top level Intake dataset. The user should be able to see the catalogues within this dataset and expand them recursively. For datasources, users should be able to insert a snippet into their notebook that loads this datasource with intake, like import intake \n intake.cat.abcd.

On the client side this requires implementing a Intake client API in Javascript, which will use messagepack. It will also require writing a JupyterLab extension that registers data converters for these Intake URLs that hit the API.

We can then extend that, if we like, to actually request the contents of the data sources and display them in some way on the client. For example, we can display a numpy array in a datagrid. This will require writing custom logic for each intake driver to know how to request a chunk and parse the resulting data.

We can also display metadata provided by intake about data sources, like their shape and dtype. We should register this with the metadata service so that the user can see metadata in right hand side pane as they navigate their catalogue. Intake allows datasources to also provide arbitrary metadata. If the driver returns this metadata in JSON LD, we can also display that in the metadata explorer.


We also discussed letting users discover catalogues by finding their intake.yml files in the file system and expanding, as well showing the catalogues provided by different python packages. That way, when users are exploring the data registry they see the source the catalogue came from, instead of seeing all catalogues flattened at the top level. Authors could also write datasets.yml files that collate these separate catalogues for a single repo. We decided against this approach for now, since Intake already has a discovery mechanism for merging all the catalogues available to users.

cc @martindurant @gwbischof

from jupyterlab-data-explorer.

martindurant avatar martindurant commented on September 28, 2024

Thanks for starting this discussion, I am actively thinking about it!

from jupyterlab-data-explorer.

saulshanabrook avatar saulshanabrook commented on September 28, 2024

@ian-r-rose has a dcat dataset intake driver that exposes metadata, so we should also try that pipeline of getting metadata from a driver, into the data explorer, and then into the metadata explorer: https://twitter.com/IanRRose/status/1182660959413784576

from jupyterlab-data-explorer.

martindurant avatar martindurant commented on September 28, 2024

OK, I think I have got over my initial reservations: the frontend is much better off talking with a REST service than with a python kernel, so may as well indeed use the Intake server. Serving the "builtin" items it something we want to allow anyway, rather than always exposing a given cat. It may be useful (but not necessary) to expose connections to other servers too, in which case instead of intake.cat.abc, you would need cat = intake.open_catalog("..."); cat.abc.

The server likes to talk msgpack, rather than JSON, I hope that everything translates to the JS side. I suppose, if the matadat can be displayed in something like YAML blocks (i.e., as they would be in the catalog), that's enough.

So what needs to happen to make progress here?

from jupyterlab-data-explorer.

saulshanabrook avatar saulshanabrook commented on September 28, 2024

It may be useful (but not necessary) to expose connections to other servers too, in which case instead of intake.cat.abc, you would need cat = intake.open_catalog("..."); cat.abc.

I agree. I think this would be good to allow after initial work exposing the default server.

The server likes to talk msgpack, rather than JSON, I hope that everything translates to the JS side.

There is a msgpack client for javascript so this should be fine.

I suppose, if the matadat can be displayed in something like YAML blocks (i.e., as they would be in the catalog), that's enough.

Is there a standard for the metadata a driver exposes? Or is it up to them to expose whatever they want? If it is JSON-LD we could expose it in the metadata service. Otherwise, we could expose it however we like with whatever UI makes sense for it.

So what needs to happen to make progress here?

  • Create new repo for this work
  • Create Python package that exposes intake API through jupyter server proxy
  • Create JS package that exposes intake API in JS
  • Create JS package that exposes JupyterLab extension which connects to intake API with JS API package and adds this to the data registry.

from jupyterlab-data-explorer.

martindurant avatar martindurant commented on September 28, 2024

Is there a standard for the metadata a driver exposes?

There are standard things that every entry has (name, description, driver, arguments), but the general metadata is totally arbitrary.

from jupyterlab-data-explorer.

martindurant avatar martindurant commented on September 28, 2024

@saulshanabrook - this dropped off the table at some point. Are you still interested?

from jupyterlab-data-explorer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.