Comments (15)
The format spec itself is still very clear about this being based on GeoArrow: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#native-encodings-based-on-geoarrow
Although I see we only link to the GeoArrow document with the names, not the the document with the memory layouts (https://geoarrow.org/format.html#native-encoding). That's something we should improve.
from geoparquet.
have the encoding just be 'columnar'
FWIW I also want to clarify that I find using "columnar" in this context is ambiguous / confusing. When using "columnar" in the context of Parquet as a "columnar file format" or Arrow's "Columnar (in-memory) Format" specification, everything we discuss here is columnar. Also with the WKB encoding, all those WKB values in the geometry column are stored in a "columnar" fashion (all the binary blobs of the WKB values are stored together in one buffer).
from geoparquet.
There was quite some discussion about this on the PR #189 (see for example the thread above and below this comment: #189 (comment)). The PR initially started with an "encoding": "geoarrow"
, and the original issue (#185) also mentioned the option of combining this with the existing geometry_types
key.
But in the end we moved away of this tight coupling with geoarrow for the naming in the spec here, although we are still using a subset of the geoarrow specification.
(from the top of my head, some reasons: this allows deviating in the future (eg adding a struct-based (sparse union) geometrycollection type that might not match exactly with something from GeoArrow), we are using a subset of the GeoArrow specification anyway (not all things allowed in GeoArrow are allowed here), and the name is also not necessarily relevant for all Parquet implementations (they cam have nothing to do with Arrow)
The format spec itself is still very clear about this being based on GeoArrow: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#native-encodings-based-on-geoarrow (and yes, that means you might need to read the spec when encountering a file with such a column, but IMO that would still be the case anyway even if the encoding said "geoarrow").
It's true that the geometry_types
key is not very useful in this specific case, but it should of course exactly match the type in the data, so something like encoding = point, geometry_types = [Polygon]
is invalid (that already follows from the current spec saying this field should match with the data, but we could also be more explicit and mention in the native encodings section that geometry_type
then is always a length-1 list)
from geoparquet.
It's true that the
geometry_types
key is not very useful in this specific case
It could be useful to know in some circumstances that the encoding is multi polygon but the column includes both polygons and multi polygons
from geoparquet.
In the future, GeoParquet should really support mixed geometry types in the same column. Sedona community will propose some solutions soon.
For example, in the lates release of Overture Maps data (https://docs.overturemaps.org/schema/), base/water
has LineString
type water (rivers) and Polygon
type water (lakes). This is not possible in the new GeoParquet encoding.
from geoparquet.
That's the risk of implementing an unreleased spec, I'd say... It should still be possible to make a change.
from geoparquet.
Someone recently pointed out to me that there's potentially more efficient geometry encoding techniques, like 'zigzag encoding coordinate deltas in varints',
While we didn't add prefixes now, that doesn't mean we can't add suffixes later if we want to add other encodings, like "point_delta"
(or at that point add a prefix like "delta.point"
)
but the latest names seem to imply that they're the 'right' way to do things, and then they also sorta 'hide' WKB, since there's a big list of arrow ones right next to the single 'wkb' name.
I wouldn't say the "right" way, but I think it is the simplest way. I personally think it is fine to use the implicit "default" namespace for that.
About hiding the single "wkb", that's seems is something we can solve with better rephrasing in the documentation (also renaming them with a prefix still requires to list all options)
While I am personally happy with what we have right now, I agree we should still allow ourselves to change things (although if we think we want to do that, I wouldn't wait too long with that, so GDAL/geopandas can be updated to follow this as fast as possible).
Both in geopandas and GDAL I think this is opt-in, and not used by default (well, for this topic; the also unreleased bbox covering column is written by default in GDAL)
(that's generally the trade-off we have with the desire to have some implementations to test things before calling the 1.1 spec final, and then actually thinking to change things before doing that ..)
from geoparquet.
I think that it is impossible not to mix some concerns with the single geometry-type encodings...the solution we settled on does have some overlap between the encoding name and the geometry type, but avoids mixing some other concerns and concepts to more accurately convey the relationships among the single-geometry layouts, Parquet and Arrow, for example.
It could be useful to know in some circumstances that the encoding is multi polygon but the column includes both polygons and multi polygons
I was under the impression that any features in a "multipolygon"
encoding were all multipolygons (even if they only contained one element). This would be consistent with something like GDAL's reporting a layer's geometry type. We should probably make that explicit in the spec language if it is currently ambiguous.
It's true that the
geometry_types
key is not very useful in this specific case,
Technically it will tell you if you have Z
coordinates or not (because the encoding
key does not contain information about dimensions but the geometry_types
key does). Leaving it empty would be a strange thing to do and I don't think it will be a problem for writers to get this part right; however, it is also not difficult (and likely safer, since it needs to check the memory layout anyway) for readers to fill in this piece of information themselves.
from geoparquet.
But in the end we moved away of this tight coupling with geoarrow for the naming in the spec here, although we are still using a subset of the geoarrow specification.
(from the top of my head, some reasons: this allows deviating in the future (eg adding a struct-based (sparse union) geometrycollection type that might not match exactly with something from GeoArrow), we are using a subset of the GeoArrow specification anyway (not all things allowed in GeoArrow are allowed here), and the name is also not necessarily relevant for all Parquet implementations (they cam have nothing to do with Arrow)
I think it may be too late, as we've already got some implementations released with this, but I was also wondering why we didn't put some 'prefix' on the encodings. Like if not geoarrow.point
then at least like columnar.point
or something (there's likely some better name). Or a new field for 'columnar_geometry_type' and have the encoding just be 'columnar' and 'wkb'.
Someone recently pointed out to me that there's potentially more efficient geometry encoding techniques, like 'zigzag encoding coordinate deltas in varints', and I had been thinking that geoparquet spec is 'open' to having a newer encoding, but the current arrow ones seem to take up the 'default namespace'. Like we could add some cool new encoding, and they'll likely have the same geometry types. I don't think it's a huge deal to have like new_encoding.point
, but the latest names seem to imply that they're the 'right' way to do things, and then they also sorta 'hide' WKB, since there's a big list of arrow ones right next to the single 'wkb' name.
from geoparquet.
Someone recently pointed out to me that there's potentially more efficient geometry encoding techniques, like 'zigzag encoding coordinate deltas in varints
Somewhat tangential - I've been pretty eager to test out new encodings like the varint/delta/zigzag Chris mentioned (used in OSM PBF) and also Parquet's native bit-packed integer encodings as well. Although these would also require x,y columns to be integers not doubles - which I thought might be a non-starter
from geoparquet.
zigzag encoding coordinate deltas in varints
I am assuming that's from/inspired by twkb? https://github.com/TWKB/Specification/blob/master/twkb.md#zigzag-encode . It would be derailing this thread to talk about that, but I think that would undo the main benefit of this encoding which is that no Geo-specific parsing is required.
That's the risk of implementing an unreleased spec
This particular change was an open PR for months that had some pretty significant discussion around how to name the encodings. It was not easy to come up with a consensus...I don't think the result is perfect, but I also don't think it has to perfectly separate concerns (just be specific enough that implementations are able to work with this, which it seems that they are). If/when a better encoding comes along, the name can be updated.
from geoparquet.
I am assuming that's from/inspired by twkb? https://github.com/TWKB/Specification/blob/master/twkb.md#zigzag-encode . It would be derailing this thread to talk about that, but I think that would undo the main benefit of this encoding which is that no Geo-specific parsing is required.
I agree and don't want to derail the conversation here. However if we could enable x and y to be fixed precision decimal we could utilize parquet native integer encodings (e.g. delta binary packed) and not need geo-specific parsing to read the coordinates properly... but let's leave that for another discussion :)
from geoparquet.
This particular change was an open PR for months that had some pretty significant discussion around how to name the encodings. It was not easy to come up with a consensus...I don't think the result is perfect, but I also don't think it has to perfectly separate concerns (just be specific enough that implementations are able to work with this, which it seems that they are).
Yeah, I felt bad that I didn't read it closely enough / didn't think about there being new potential encodings.
If/when a better encoding comes along, the name can be updated.
Ah, that's a good point - I was sorta thinking we'd be 'stuck' with these names. But I guess the client logic would just have to be that at version 1.4 or whatever it would need to check the version number.
While we didn't add prefixes now, that doesn't mean we can't add suffixes later if we want to add other encodings, like "point_delta" (or at that point add a prefix like "delta.point")
Cool, I like this idea.
About hiding the single "wkb", that's seems is something we can solve with better rephrasing in the documentation (also renaming them with a prefix still requires to list all options)
Yeah, that sounds good. I'll look into tweaks that may make it clearer. It's good in the full description (one of the geometry types, or WKB), but in the single line it just looks like one of many options.
I personally think it is fine to use the implicit "default" namespace for that.
Sounds great. Thanks to everyone for sounding in, and I agree. I definitely didn't want to try to be pushing for a change at this late time. I mostly just wanted to be sure we had a path to other encodings, and weren't backing ourselves into a corner with declaring this one. But it sounds like we have lots of options. And it's good to have the discussion, to point at in the future if/when people want to propose other encodings.
from geoparquet.
have the encoding just be 'columnar'
FWIW I also want to clarify that I find using "columnar" in this context is ambiguous / confusing. When using "columnar" in the context of Parquet as a "columnar file format" or Arrow's "Columnar (in-memory) Format" specification, everything we discuss here is columnar.
Yeah, I was pretty sure that was a poor name - I just didn't want to spend a lot of time crafting an 'ideal' name when all I was trying to do was to make the point. If we had overwhelming enthusiasm to change (which to be clear I personally don't) then we could figure out the 'right' name.
from geoparquet.
Closing this issue, as it looks like it was well discussed. There were some interesting things in this discussion, like #207 (comment) but we should just have cleaner issues / PR's.
from geoparquet.
Related Issues (20)
- Simplify or remove script dependencies HOT 3
- PROJJSON schema version HOT 4
- Metadata encoding options for GeoArrow-encoded columns in GeoParquet metadata HOT 2
- Is it possible to define a transform alongside a CRS, similar to geotiff? HOT 3
- Recommendation on the Arrow specific type for the WKB geometry column ? HOT 5
- Antimeridian Crossings and bbox HOT 9
- Update example files for 1.1 HOT 4
- The releases on the repository can be misleading regarding the status of GeoParquet as an OGC Standard HOT 1
- Clarify projection of bounding box columns HOT 4
- Covering Schema
- Clarify recommended file extension HOT 9
- List of Submitting Organisations HOT 3
- Enforce pull requests and approvals for all repository updates HOT 4
- Require status checks to pass before merging HOT 4
- Synchronise requirements in the metanorma asciidoc files with those in the gpq validator HOT 1
- add support wkt or wkt2 formats for crs HOT 26
- Thoughts about a first-class GEOMETRY data type in Parquet? HOT 20
- Start a 'best practices' document
- Forward compatibility guarantees? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from geoparquet.