Giter Site home page Giter Site logo

Storing tables about ngff HOT 16 OPEN

ome avatar ome commented on July 23, 2024
Storing tables

from ngff.

Comments (16)

unidesigner avatar unidesigner commented on July 23, 2024 4

An R-Tree is probably what you are looking for. Not sure if this can be serialized in a way where it is not necessary to do a full-table scan to find out about the relevant rows.

The pandas docs has some interesting links as well for out-of-memory data formats/library, in particular the ecosystem page. It's not only about purely fetching data for visualization purposes, but also for efficient compute.

from ngff.

oeway avatar oeway commented on July 23, 2024 3

Hey,

@joshmoore Would it make sense to consider the spatial partitioning information as some sort of annotation to the x,y,z columns?

@kevinyamauchi Good to see you here too! Yes, I think it does! It will be certainly useful for the use case I am targeting (i.e. SMLM data), I can also see it will be super useful to store massive scatter plots, e.g. generated from scRNA-seq.

I just did a quick read in your existing PR. In practice, if we do want to support octree (that's the one mostly used for displaying LiDAR sensory data and has been proven to work with enven trillions of points for browser-based visualization), would it mean we just add additional tables to var? I would be happy to work with any of you to make a data loader to bridge with the potree viewer (the one I am using now).

from ngff.

ivirshup avatar ivirshup commented on July 23, 2024 2

Would it make sense to consider the spatial partitioning information as some sort of annotation to the x,y,z columns?

I think it might make sense to consider spatial indexing as a property of the coordinate array. Especially if you have the same points represented in different coordinate spaces (e.g. slide by itself, slide aligned in stack).

from ngff.

unidesigner avatar unidesigner commented on July 23, 2024 1

Just out of curiosity, what is the reason of wanting to store tabular data in zarr, v.s. using some existing, optimized data formats, like Avro, Parquet, Sqlite etc. ?

from ngff.

constantinpape avatar constantinpape commented on July 23, 2024 1

My worry was that AnnData is much richer than a simple table

I don't think that AnnData is much richer than a simple table; at least not the subset that we are discussing here. But we have the major advantage that the dtype for each column is known ...

Do you think that one could also display AnnData in a "simple table viewer"?

Sure. Load X into a 2d array, load obs into a 2d array (this works in python where complex dtypes are easy, I don't know how you would do this in java, but you have the same problem in csv), concatenate along the first axis (=columns). This gives you a simple table. (The only question is what to do about var, but for simplicity it could just be ignored).

from ngff.

oeway avatar oeway commented on July 23, 2024 1

I have been testing octree-based spatial partitioning of point cloud (using a library called potree) for the shareloc.xyz platform. It allows us visualising large point cloud instantly (instead of downloading everything).

Here is a demo for visualising single-molecule localisation microscopy data:
https://imodpasteur.github.io/shareloc-utils/shareloc-potree-viewer.html?pointShape=circle&pointSizeType=adaptive&name=FFB000&load=https://imjoy-s3.pasteur.fr/public/pointclouds/7312e0.zip

When you open and zoom in, more point chunks will be loaded to the browser.

The tree is stored in a zip file and I used HTTP Range request to obtain the chunks.

As I understand, the tabular support we are discussing here won't allow storing point chunks organized in a tree yet, am I right?

cc @joshmoore

from ngff.

joshmoore avatar joshmoore commented on July 23, 2024 1

Also cc: @kevinyamauchi and @ivirshup who are also discussing more on this this week.

I think you are right that there's no tree representation in the current discussions, but perhaps it's more a matter of AND rather than OR. That is, my understanding of the benefit of tabular layout is the ability to add annotations to the data. How would that work in the three representation? Does one need both?

For those in interested in taking a look, here are some brief details on the contents of @oeway's zip:

unzipped 7312e0.zip
cat sources.json | jq .
{
  "bounds": {
    "min": [
      1600.013671875,
      1633.791748046875,
      0
    ],
    "max": [
      41039.4140625,
      40804.4296875,
      0
    ]
  },
  "projection": "",
  "sources": [
    {
      "name": ".tmp.txt",
      "points": 16898373,
      "bounds": {
        "min": [
          1600.013671875,
          1633.791748046875,
          0
        ],
        "max": [
          41039.4140625,
          40804.4296875,
          0
        ]
      }
    }
  ]
}

cat cloud.js | jq .
{
  "version": "1.7",
  "octreeDir": "data",
  "projection": "",
  "points": 16898373,
  "boundingBox": {
    "lx": 1600.013671875,
    "ly": 1633.791748046875,
    "lz": 0,
    "ux": 41039.4140625,
    "uy": 41073.192138671875,
    "uz": 39439.400390625
  },
  "tightBoundingBox": {
    "lx": 1600.013671875,
    "ly": 1633.791748046875,
    "lz": 0,
    "ux": 41039.4140625,
    "uy": 40804.4296875,
    "uz": 0
  },
  "pointAttributes": [
    "POSITION_CARTESIAN",
    "COLOR_PACKED"
  ],
  "spacing": 341.55523681640625,
  "scale": 0.001,
  "hierarchyStepSize": 5
}

tree data/ | head
data/
└── r
    ├── 00060
    │   ├── r00060.bin
    │   └── r00060.hrc
    ├── 00062
    │   ├── r00062.bin
    │   └── r00062.hrc
    ├── 00064
    │   ├── r00064.bin

tree data/ | tail
    ├── r6642.bin
    ├── r6644.bin
    ├── r6646.bin
    ├── r666.bin
    ├── r6660.bin
    ├── r6662.bin
    ├── r6664.bin
    └── r6666.bin

760 directories, 3368 files

from ngff.

imagesc-bot avatar imagesc-bot commented on July 23, 2024

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/ome-ngff-spatial-omics-hackathon/57337/28

from ngff.

joshmoore avatar joshmoore commented on July 23, 2024

toCSV will be easy enough. fromCSV will for many (if not most?) cases require extra metadata. There are a number of attempts to provide such metadata, e.g., https://specs.frictionlessdata.io/data-package/

from ngff.

tischi avatar tischi commented on July 23, 2024

I think the point was to be able to efficiently load chunks of the table from an object store. I don't have enough technological knowledge to judge if any of the existing solutions would do the trick. Do you know?

from ngff.

constantinpape avatar constantinpape commented on July 23, 2024

I wanted to ask whether we should consider that whatever we store can be easily mapped onto a csv file. Meaning that fromCSV and toCSV should work smoothly such that other software that can work with tables can be interoperable with the tabular content of the ome-zarr.

I think that compatibility with csv is desirable, but I am not sure how much can be done about this on the spec level.
I see this as more of a software than data standard question.

Just out of curiosity, what is the reason of wanting to store tabular data in zarr,

I would say the main reason is to provide all relevant data in the same data format and container.
Also note that AnnData, which the proposal is based on, is using zarr as storage already.

from ngff.

tischi avatar tischi commented on July 23, 2024

Also note that AnnData, which the proposal is based on, is using zarr as storage already.

My worry was that AnnData is much richer than a simple table and thus it may be difficult to map it onto a "simple table"? For example, both in Fiji and Napari there are ways to display a table. Do you think that one could also display AnnData in a "simple table viewer"?

from ngff.

unidesigner avatar unidesigner commented on July 23, 2024

I think the point was to be able to efficiently load chunks of the table from an object store. I don't have enough technological knowledge to judge if any of the existing solutions would do the trick. Do you know?

If found this comparison of zarr and parquet interesting. Especially choosing zarr over parquet for flexibility and the append option. https://waterdata.usgs.gov/blog/cloud_data/

I don't know all the formats in detail, but I imagine that it's not the first time this requirement comes up, and people have implemented solutions for this.

A question I'd have is about the index, i.e. how do you know what slice of data you want to fetch?

from ngff.

tischi avatar tischi commented on July 23, 2024

A question I'd have is about the index, i.e. how do you know what slice of data you want to fetch?

Good question! Typically one row in the table would correspond to a specific region (e.g. a point) in the image.

One use-case is that if people look at an image region one wants to load all the rows that correspond to this image region, e.g. in order to render something on the image.

We were thinking that an efficient image-coordinate to table-row mapping could be done by a tree where you enter the coordinate and the leaves of the tree are the table-row indices. However, how to serialize a tree into ome-zarr is something that we did not look into yet....

from ngff.

imagesc-bot avatar imagesc-bot commented on July 23, 2024

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/next-generation-user-friendly-smlm-processing-software-aka-thunderstorm-2-0/62289/20

from ngff.

kevinyamauchi avatar kevinyamauchi commented on July 23, 2024

Hey @oeway ! Nice to see you here. Super cool that you're looking into rendering with spatial partitioning.

Indeed, we are currently focusing on storing points in a table and we are not planning to specify the format for spatial indices (for now). I think there are too many different strategies and the best one is likely application dependent, so I think it doesn't make sense to standardize that. I am definitely open to adding specs for some common spatial indices (e.g., octree, rtree) at some point once we have the basic table spec nailed down.

The pattern I would advocate for is that one queries the spatial index (e.g., octree) to look up the rows to fetch from the table for rendering. The table can be chunked along the rows, so this will allow points to be loaded lazily. Of course, the performance will depending on your chunking and the ordering of your table.

What do you think @oeway , does this sound reasonable?

from ngff.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.