Giter Site home page Giter Site logo

Comments (2)

jorisvandenbossche avatar jorisvandenbossche commented on June 12, 2024 2

Some advantages/disadvantages I can think of for the different options how to specify this:

{"encoding": "geoarrow.point", ...}

Pro is that this the encoding value fully describes the geoarrow type. But a disadvantage is that this adds a whole series of possible values for the "encoding" key. This makes handling of this key a bit more complex (although in Python terms it would be col["encoding"].startswith("geoarrow") instead of col["encoding"] == "geoarrow")

{"encoding": "geoarrow", "extension_name": "geoarrow.point"}

Pro is that this adds only a single new "encoding" value. But then you also still need to check the value of the other key to get the actual type.
If we go with this, I would rather use a different key than "extension_name". The "extension" in this is a rather Arrow-specific term, and while the encoding itself is also called "geoarrow", this can still be implemented by Parquet implementations or systems that don't have anything to do with Arrow. We could also use a more generic "geoarrow_type"?

{"encoding": "geoarrow", ..., "geometry_types": ["Point"]}

Similar advantage of only adding a single "encoding" value, and additional advantage of not having to add a custom key that is only needed for geoarrow encoded data like above. But clear disadvantage is that you need to transform and combine the two keys manually to get the actual geoarrow type name.

from geoparquet.

jorisvandenbossche avatar jorisvandenbossche commented on June 12, 2024

I don't like that last option because there are GeoArrow extension types for WKT and WKB. Even if they aren't necessarily allowed/encouraged for use in this spec, I don't think we can guarantee that there is one canonical extension name per combination of geometry types and functionally the extension name is what is required for a reader implementation

I do think that we should probably require to use "encoding": "WKB" for those cases, and disallow "encoding": "geoarrow.wkb", because otherwise that gives two ways to specify the same? And while this requires some name mapping from geoarrow-aware writers, it ensures that all existing readers will still work fine for files using WKB.

(which I think also makes this option of using "encoding": "geoarrow" combined with geometry_types a possibility, although still not necessarily a preferred option)

The second consideration is which GeoArrow memory layouts to allow.

I think we should best list the options that are allowed. We can always expand that later if geoarrow grows more options.
(for the example you gave, is there a reason you only listed "geoarrow.multipolygon" and not "geoarrow.polygon"?)

For the interleaved vs separated layout: I think it is clear that the separated layout has the most benefit in combination with Parquet, because of the statistics you get for free (and maybe better compression / faster reading). But I am not fully sure we should only allow that layout. It's certainly possible to have a case where you don't care about this, and you just need the fastest possible option to store and re-read a bunch of data. And if your target system needs interleaved data (like shapely/geopandas), storing as interleaved might be the fastest option (although I should verify this in practice!)

For the actual specification update, we should probably detail for the different geoarrow types to which Parquet type it maps.

from geoparquet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.