Giter Site home page Giter Site logo

Comments (15)

jorisvandenbossche avatar jorisvandenbossche commented on June 12, 2024 3

The format spec itself is still very clear about this being based on GeoArrow: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#native-encodings-based-on-geoarrow

Although I see we only link to the GeoArrow document with the names, not the the document with the memory layouts (https://geoarrow.org/format.html#native-encoding). That's something we should improve.

from geoparquet.

jorisvandenbossche avatar jorisvandenbossche commented on June 12, 2024 3

have the encoding just be 'columnar'

FWIW I also want to clarify that I find using "columnar" in this context is ambiguous / confusing. When using "columnar" in the context of Parquet as a "columnar file format" or Arrow's "Columnar (in-memory) Format" specification, everything we discuss here is columnar. Also with the WKB encoding, all those WKB values in the geometry column are stored in a "columnar" fashion (all the binary blobs of the WKB values are stored together in one buffer).

from geoparquet.

jorisvandenbossche avatar jorisvandenbossche commented on June 12, 2024 1

There was quite some discussion about this on the PR #189 (see for example the thread above and below this comment: #189 (comment)). The PR initially started with an "encoding": "geoarrow", and the original issue (#185) also mentioned the option of combining this with the existing geometry_types key.

But in the end we moved away of this tight coupling with geoarrow for the naming in the spec here, although we are still using a subset of the geoarrow specification.
(from the top of my head, some reasons: this allows deviating in the future (eg adding a struct-based (sparse union) geometrycollection type that might not match exactly with something from GeoArrow), we are using a subset of the GeoArrow specification anyway (not all things allowed in GeoArrow are allowed here), and the name is also not necessarily relevant for all Parquet implementations (they cam have nothing to do with Arrow)

The format spec itself is still very clear about this being based on GeoArrow: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#native-encodings-based-on-geoarrow (and yes, that means you might need to read the spec when encountering a file with such a column, but IMO that would still be the case anyway even if the encoding said "geoarrow").

It's true that the geometry_types key is not very useful in this specific case, but it should of course exactly match the type in the data, so something like encoding = point, geometry_types = [Polygon] is invalid (that already follows from the current spec saying this field should match with the data, but we could also be more explicit and mention in the native encodings section that geometry_type then is always a length-1 list)

from geoparquet.

kylebarron avatar kylebarron commented on June 12, 2024 1

It's true that the geometry_types key is not very useful in this specific case

It could be useful to know in some circumstances that the encoding is multi polygon but the column includes both polygons and multi polygons

from geoparquet.

jiayuasu avatar jiayuasu commented on June 12, 2024 1

In the future, GeoParquet should really support mixed geometry types in the same column. Sedona community will propose some solutions soon.

For example, in the lates release of Overture Maps data (https://docs.overturemaps.org/schema/), base/water has LineString type water (rivers) and Polygon type water (lakes). This is not possible in the new GeoParquet encoding.

from geoparquet.

m-mohr avatar m-mohr commented on June 12, 2024 1

That's the risk of implementing an unreleased spec, I'd say... It should still be possible to make a change.

from geoparquet.

jorisvandenbossche avatar jorisvandenbossche commented on June 12, 2024 1

Someone recently pointed out to me that there's potentially more efficient geometry encoding techniques, like 'zigzag encoding coordinate deltas in varints',

While we didn't add prefixes now, that doesn't mean we can't add suffixes later if we want to add other encodings, like "point_delta" (or at that point add a prefix like "delta.point")

but the latest names seem to imply that they're the 'right' way to do things, and then they also sorta 'hide' WKB, since there's a big list of arrow ones right next to the single 'wkb' name.

I wouldn't say the "right" way, but I think it is the simplest way. I personally think it is fine to use the implicit "default" namespace for that.
About hiding the single "wkb", that's seems is something we can solve with better rephrasing in the documentation (also renaming them with a prefix still requires to list all options)


While I am personally happy with what we have right now, I agree we should still allow ourselves to change things (although if we think we want to do that, I wouldn't wait too long with that, so GDAL/geopandas can be updated to follow this as fast as possible).
Both in geopandas and GDAL I think this is opt-in, and not used by default (well, for this topic; the also unreleased bbox covering column is written by default in GDAL)

(that's generally the trade-off we have with the desire to have some implementations to test things before calling the 1.1 spec final, and then actually thinking to change things before doing that ..)

from geoparquet.

paleolimbot avatar paleolimbot commented on June 12, 2024

I think that it is impossible not to mix some concerns with the single geometry-type encodings...the solution we settled on does have some overlap between the encoding name and the geometry type, but avoids mixing some other concerns and concepts to more accurately convey the relationships among the single-geometry layouts, Parquet and Arrow, for example.

It could be useful to know in some circumstances that the encoding is multi polygon but the column includes both polygons and multi polygons

I was under the impression that any features in a "multipolygon" encoding were all multipolygons (even if they only contained one element). This would be consistent with something like GDAL's reporting a layer's geometry type. We should probably make that explicit in the spec language if it is currently ambiguous.

It's true that the geometry_types key is not very useful in this specific case,

Technically it will tell you if you have Z coordinates or not (because the encoding key does not contain information about dimensions but the geometry_types key does). Leaving it empty would be a strange thing to do and I don't think it will be a problem for writers to get this part right; however, it is also not difficult (and likely safer, since it needs to check the memory layout anyway) for readers to fill in this piece of information themselves.

from geoparquet.

cholmes avatar cholmes commented on June 12, 2024

But in the end we moved away of this tight coupling with geoarrow for the naming in the spec here, although we are still using a subset of the geoarrow specification.
(from the top of my head, some reasons: this allows deviating in the future (eg adding a struct-based (sparse union) geometrycollection type that might not match exactly with something from GeoArrow), we are using a subset of the GeoArrow specification anyway (not all things allowed in GeoArrow are allowed here), and the name is also not necessarily relevant for all Parquet implementations (they cam have nothing to do with Arrow)

I think it may be too late, as we've already got some implementations released with this, but I was also wondering why we didn't put some 'prefix' on the encodings. Like if not geoarrow.point then at least like columnar.point or something (there's likely some better name). Or a new field for 'columnar_geometry_type' and have the encoding just be 'columnar' and 'wkb'.

Someone recently pointed out to me that there's potentially more efficient geometry encoding techniques, like 'zigzag encoding coordinate deltas in varints', and I had been thinking that geoparquet spec is 'open' to having a newer encoding, but the current arrow ones seem to take up the 'default namespace'. Like we could add some cool new encoding, and they'll likely have the same geometry types. I don't think it's a huge deal to have like new_encoding.point, but the latest names seem to imply that they're the 'right' way to do things, and then they also sorta 'hide' WKB, since there's a big list of arrow ones right next to the single 'wkb' name.

from geoparquet.

jwass avatar jwass commented on June 12, 2024

Someone recently pointed out to me that there's potentially more efficient geometry encoding techniques, like 'zigzag encoding coordinate deltas in varints

Somewhat tangential - I've been pretty eager to test out new encodings like the varint/delta/zigzag Chris mentioned (used in OSM PBF) and also Parquet's native bit-packed integer encodings as well. Although these would also require x,y columns to be integers not doubles - which I thought might be a non-starter

from geoparquet.

paleolimbot avatar paleolimbot commented on June 12, 2024

zigzag encoding coordinate deltas in varints

I am assuming that's from/inspired by twkb? https://github.com/TWKB/Specification/blob/master/twkb.md#zigzag-encode . It would be derailing this thread to talk about that, but I think that would undo the main benefit of this encoding which is that no Geo-specific parsing is required.

That's the risk of implementing an unreleased spec

This particular change was an open PR for months that had some pretty significant discussion around how to name the encodings. It was not easy to come up with a consensus...I don't think the result is perfect, but I also don't think it has to perfectly separate concerns (just be specific enough that implementations are able to work with this, which it seems that they are). If/when a better encoding comes along, the name can be updated.

from geoparquet.

jwass avatar jwass commented on June 12, 2024

I am assuming that's from/inspired by twkb? https://github.com/TWKB/Specification/blob/master/twkb.md#zigzag-encode . It would be derailing this thread to talk about that, but I think that would undo the main benefit of this encoding which is that no Geo-specific parsing is required.

I agree and don't want to derail the conversation here. However if we could enable x and y to be fixed precision decimal we could utilize parquet native integer encodings (e.g. delta binary packed) and not need geo-specific parsing to read the coordinates properly... but let's leave that for another discussion :)

from geoparquet.

cholmes avatar cholmes commented on June 12, 2024

This particular change was an open PR for months that had some pretty significant discussion around how to name the encodings. It was not easy to come up with a consensus...I don't think the result is perfect, but I also don't think it has to perfectly separate concerns (just be specific enough that implementations are able to work with this, which it seems that they are).

Yeah, I felt bad that I didn't read it closely enough / didn't think about there being new potential encodings.

If/when a better encoding comes along, the name can be updated.

Ah, that's a good point - I was sorta thinking we'd be 'stuck' with these names. But I guess the client logic would just have to be that at version 1.4 or whatever it would need to check the version number.

While we didn't add prefixes now, that doesn't mean we can't add suffixes later if we want to add other encodings, like "point_delta" (or at that point add a prefix like "delta.point")

Cool, I like this idea.

About hiding the single "wkb", that's seems is something we can solve with better rephrasing in the documentation (also renaming them with a prefix still requires to list all options)

Yeah, that sounds good. I'll look into tweaks that may make it clearer. It's good in the full description (one of the geometry types, or WKB), but in the single line it just looks like one of many options.

I personally think it is fine to use the implicit "default" namespace for that.

Sounds great. Thanks to everyone for sounding in, and I agree. I definitely didn't want to try to be pushing for a change at this late time. I mostly just wanted to be sure we had a path to other encodings, and weren't backing ourselves into a corner with declaring this one. But it sounds like we have lots of options. And it's good to have the discussion, to point at in the future if/when people want to propose other encodings.

from geoparquet.

cholmes avatar cholmes commented on June 12, 2024

have the encoding just be 'columnar'

FWIW I also want to clarify that I find using "columnar" in this context is ambiguous / confusing. When using "columnar" in the context of Parquet as a "columnar file format" or Arrow's "Columnar (in-memory) Format" specification, everything we discuss here is columnar.

Yeah, I was pretty sure that was a poor name - I just didn't want to spend a lot of time crafting an 'ideal' name when all I was trying to do was to make the point. If we had overwhelming enthusiasm to change (which to be clear I personally don't) then we could figure out the 'right' name.

from geoparquet.

cholmes avatar cholmes commented on June 12, 2024

Closing this issue, as it looks like it was well discussed. There were some interesting things in this discussion, like #207 (comment) but we should just have cleaner issues / PR's.

from geoparquet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.