Giter Site home page Giter Site logo

opengeospatial / geoparquet Goto Github PK

View Code? Open in Web Editor NEW
722.0 50.0 49.0 246 KB

Specification for storing geospatial vector data (point, line, polygon) in Parquet

Home Page: https://geoparquet.org

License: Apache License 2.0

Python 100.00%
gis cloud-native geospatial apache-parquet geoparquet

geoparquet's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

geoparquet's Issues

Should we recommend EPSG:4326 or something else?

This was discussed extensively in #25, but it feels worth revisiting. I think everyone feels good that the core recommendation is to use longitude, latitude in the WKB as the interoperability recommendation. The main question here is how we 'describe' that - use 4326 but then rely on the 'override' in our spec to put longitude first, or use something like OGC:84 that is less popular but actually describes things right.

Other points that were originally in #35:

  • Do we want to continue recommending EPSG:4326?
    • The GDAL page on this topic says: "The generic EPSG:4326 WGS 84 CRS is also considered dynamic, although it is not recommended to use it due to being based on a datum ensemble whose positional accuracy is 2 meters, but prefer one of its realizations, such as WGS 84 (G1762)"
    • For example, QGIS now warns about this, see https://twitter.com/nyalldawson/status/1390118738251317254 for some context.

Store information about planar vs spherical coordinates (eg geodesic=true/false)

This has come up in several places (eg most recently in #3 (comment)), and brought up in a previous meeting.

Geospatial analytical systems can interpret / treat geometries' coordinates as planar or spherical. For example GEOS considers everything as planar coordinates (and thus also GeoPandas, or R's sf or PostGIS when using GEOS). Other libraries can handle spherical coordinates, such R's s2 package (which is now used as default in sf when having geographical coordinates) or BigQuery's Geography functionality (I think both based on Google's s2geometry). PostGIS also differentiates between a geometry and geography type.

Once you deal with spherical coordinates, you also have to deal with the edges of geometries. A geometry can be valid (i.e. no intersecting edges) when interpreting the coordinates as planar, but could be invalid when interpreting the geometries as spherical.
And that's where the IO aspect comes into the picture. When reading in data (and working with spherical coordinates), you can either 1) assume the edges are already valid as spherical coordinates, or 2) do a conversion of planar edge to spherical edge.

For example, BigQuery assumes spherical edges when reading in from WKT (with the planar=TRUE option in ST_GEOGFROMTEXT to override this default), but planar edges when parsing GeoJSON (see https://cloud.google.com/bigquery/docs/geospatial-data#coordinate_systems_and_edges).

Quoting @paleolimbot:

There is currently no way to communicate in any file "this was exported from BigQuery Geography or S2 so you can import it there again without tessellating all the edges again" (e.g., use planar = true when importing to BigQuery).

So it would be useful to store this information about the edges in the file metadata, instead of having the user of the data to know this and specify it as an option while reading the data.

The concrete proposal would be to have an additional column metadata field to indicate this. I think a boolean flag is fine for this, and possible names are "geodesic": true/false or "planar": true/false.


Note: I am no expert on this front (GeoPandas is, for now, still only using GEOS and thus planar coordinates, so I don't have much experience with handling spherical coordinates). So please correct me if anything in the above isn't fully correct :)

nz-buildings-outlines.parquet sample file uses 'schema_version' instead of 'version'

https://storage.googleapis.com/open-geodata/linz-examples/nz-buildings-outlines.parquet has the following 'geo' metadata value:

{
  "primary_column": "geometry",
  "columns": {
    "geometry": {
      "crs": "PROJCRS[\"NZGD2000 / New Zealand Transverse Mercator 2000\",BASEGEOGCRS[\"NZGD2000\",DATUM[\"New Zealand Geodetic Datum 2000\",ELLIPSOID[\"GRS 1980\",6378137,298.257222101,LENGTHUNIT[\"metre\",1]]],PRIMEM[\"Greenwich\",0,ANGLEUNIT[\"degree\",0.0174532925199433]],ID[\"EPSG\",4167]],CONVERSION[\"New Zealand Transverse Mercator 2000\",METHOD[\"Transverse Mercator\",ID[\"EPSG\",9807]],PARAMETER[\"Latitude of natural origin\",0,ANGLEUNIT[\"degree\",0.0174532925199433],ID[\"EPSG\",8801]],PARAMETER[\"Longitude of natural origin\",173,ANGLEUNIT[\"degree\",0.0174532925199433],ID[\"EPSG\",8802]],PARAMETER[\"Scale factor at natural origin\",0.9996,SCALEUNIT[\"unity\",1],ID[\"EPSG\",8805]],PARAMETER[\"False easting\",1600000,LENGTHUNIT[\"metre\",1],ID[\"EPSG\",8806]],PARAMETER[\"False northing\",10000000,LENGTHUNIT[\"metre\",1],ID[\"EPSG\",8807]]],CS[Cartesian,2],AXIS[\"northing (N)\",north,ORDER[1],LENGTHUNIT[\"metre\",1]],AXIS[\"easting (E)\",east,ORDER[2],LENGTHUNIT[\"metre\",1]],USAGE[SCOPE[\"Engineering survey, topographic mapping.\"],AREA[\"New Zealand - North Island, South Island, Stewart Island - onshore.\"],BBOX[-47.33,166.37,-34.1,178.63]],ID[\"EPSG\",2193]]",
      "encoding": "WKB",
      "bbox": [
        1167512.311218509,
        4794679.949937864,
        2089113.650566361,
        6190596.90070761
      ]
    }
  },
  "schema_version": "0.1.0",
  "creator": {
    "library": "geopandas",
    "version": "0.10.2"
  }
}

schema_version should be renamed as version

Potential JSON Schema issues

Some potential improvements and/or issues in the JSON Schema:

  • $defs keyword in the schema was only added in draft 2019-09 (see also #127 )
  • additionalProperties: true is the default and could be removed
  • "patternProperties": with a pattern ".*" is basically equivalent to additionalProperties and thus can be simplified a bit.
  • "enum": ["WKB"] is equivalent to "const": "WKB", the same applies to "enum": ["counterclockwise"]
  • PROJJSON has a new schema version, 0.5
  • The bbox schema looks weird. Are more than 4 values allowed? Right now it would allow [0,1,2,3,"asdsad"]. Could probably also be simplified using min/maxItems... I think you were looking for something like "items": {"type": "number"}, "minItems": 4
  • By the way, the spec says double for epoch, but this is not enforced in the schema and could be an int (or should this be "number" in the spec?)
  • I understand the spec that you can either specify an array of geometry types or a single geometry type as string or Unknown (so always as a string, not in an array). If this is the case, the schema doesn't check it. ["Polygon", "Unknown"] is valid.
  • The Z suffix to the geometry types is not part of the JSON Schema it seems

Happy to work on PRs, just let me know whether all this is correctly captured.

Update JSON Schema version?

Just a random thing I noticed while looking through the repo: Not sure whether this comes from the fact that all STAC JSON Schemas are still draft 7, but you may want to consider updating to a more recent version. That should make it a bit more future-proof as you never know when the old JSON Schema versions will be dropped from libraries. Draft-7 is from March 2018 and has been superseeded by draft-2019-09, which itself for superseeded by draft-2020-12.

You are actually already partially into 2019-09 as for example the $defs keyword in the schema was only added in this version.

Encoding of the key-value metadata when stored in the Parquet FileMetadata

Currently, this is not exactly described as such in the spec (#6 is clarifying this), but in practice we are storing the geospatial metadata as a JSON-encoded string (json.dumps(..) in python terms, see the example file and implementation at https://github.com/opengeospatial/cdw-geo/tree/main/examples/geoparquet).

This means that the actual value that we store in the Parquet FileMetaData's key_value_metadata under the "geo" key, is a string value like '{"version": "0.1.0", "primary_column": "geometry", "columns": {"geometry": {"crs": ... }}}'

I am opening this issue to confirm explicitly that we are fine with this, or whether we want to consider a different formatting while further refining the spec.

Validator improvements

For 1.0.0 we should have a validator that:

  1. Tests not just the metadata but looks at the data itself to make sure it matches the metadata
  2. Is user-friendly, not requiring python. Ideally a web-page and/or an easily installable binary.

This could be building on the current validator in this repo, or could be a new project we reference, but we want to be sure something exists, so putting this issue in to track it.

What are the Allowed Geometry Types?

The documentation just states that the geometry_type can be list or a single string. Does this mean we can pass the geometry_type as [Point, MultiPoint] or does it have to be a string: "Point,MultiPoint" for multiple geometries?

Also in the md file the allowed geometry enums are not listed.

Fix the validator so it can install more widely.

Moving the schema to format-specs broke the validator: pip install . does not work with a symlink resource. It works with pip install -e ., but this is something we need to tackle, especially if we want to distribute the validator as a python package.

Originally posted by @Jesus89 in #87 (review)

crs data type: JSON object

The CRS data type says "JSON object", which could be understood as object serialized as JSON (i.e. a string), so this should probably just say "object"?

Define polygon orientation rules

I think the standard should define polygon orientation.

1. Spherical edges case

With spherical edges on sphere, there is an ambiguity in polygon definition, if the system allows polygons larger than hemisphere.

A sequence of vertices that define a polygon boundary can define either polygon to the left of that line, or to the right of the line. E.g. global coastal line can define either continents or oceans. Systems that support polygons larger than hemisphere usually use orientation rule to solve this ambiguity. E.g. MS SQL, Google BigQuery interpret the side to the left of the line as the content of the ring.

2. Planar edges case

Planar case does not have such ambiguity, but it is still good idea to have specific rule.

E.g. GeoJson RFC defines a rule consistent with the rule above:

   o  Polygon rings MUST follow the right-hand rule for orientation
      (counterclockwise external rings, clockwise internal rings).

Spherical - orientation required, smaller-of-two-possible, or just recommended?

It looks like there are a couple options for the case where edges is spherical:

  • If orientation is spherical than counterclockwise orientation is required. (Or say that if it is not set then the default is counterclockwise instead of null - effectively the same, but maybe slightly better?)
  • If spherical and orientation is left blank then have implementations use the 'smaller-of-two-possible' rule, as used by bigquery, sqlserver.

We could also just 'recommend' its use, and not mention the smaller-of-two-possible rule. Though that seems far from ideal for me, as it doesn't tell implementations what to do if they get spherical data without it set.

Currently in main it does say use the smaller-of-two-possible rule, but it is likely poorly described, as I wrote it and was just trying to capture I don't 100% understand.

In #80 @jorisvandenbossche removed the rule of two possible thing. Which I think is totally fine. But I'd like us to make an explicit decision about it.

Originally posted by @mentin in #46 (comment)

Store optional bounding box information in the column metadata

There was a bit of discussion around this in #4.

The proposal is to add an optional column metadata field (alongside the currently required "crs" and "encoding" fields) that describes the bounding box of the full file (so the overall bounding box or envelope of all geometries in the file).

In the geo-arrow-spec version of this metadata specification, we are already using it (https://github.com/geopandas/geo-arrow-spec/blob/main/metadata.md#bounding-boxes), and there it takes the form of a a list that specifies the minimum and maximum values of each dimension. So for 2D data it would look like "bbox" : [<xmin>, <ymin>, <xmax>, <ymax>].

This formatting aligns with for example the GeoJSON spec (https://datatracker.ietf.org/doc/html/rfc7946#section-5).


This optional information can be useful when processing this data. For example, in dask-geopandas we already make use of this feature to filter partitions (sub-datasets) of a dataset. When using Parquet, people often make use of "partitioned datasets", where the dataset consists of (potentially nested directories of) many smaller Parquet files. In such a situation, you could spatially sort the data when dividing into partitions and each individual file could contain the data of a certain region. If each individual Parquet file would then store information about the bounding box of their geometries, this allows to only read those files needed when doing a spatial query while reading the dataset (a kind of "predicate pushdown", as can be done for Parquet based on column statistics).

Proposal: suggested practices around required / optional metadata fields and spec extensions

My goal here is to try and reframe some of the challenges I've seen us struggling with around specific metadata fields and data standardization, and outline some suggestions about how to approach these issues in order to reduce version-to-version churn and increase implementer buy-in. This is intended to be more at a "meta" level; discussion of specific fields should happen in specific GH issues.

In my own opinion, the primary goal of the specification is to support interoperability through standard documentation of what is confidently known about a dataset, so that writers can document what they know in a standardized fashion, and readers can trust and operate correctly on the dataset based what is documented. This can be achieved while still supporting the highly variable underlying data representations (winding order, CRS, etc) that can be contained within the existing encoding (WKB) and support both high-performance, low transformation internal use - which was the primary goal of the original spec from which this emerged - as well as some degree of portability within the broader spatial ecosystem. Unfortunately, this still places some burden on readers to deal with some of the messier issues of lack of standardization in the underlying data.

There are secondary goals around standarizing data representation in order to support greater interoperability, but these typically involve more serious tradeoffs in performance, transformation, and potentially loss of information compared to the original untransformed data. To make this the primary goal through combinations of metadata fields and / or other specification requirements undermines the original benefits of the format and risks forcing implementations to then bifurcate their handling of data encoded into this container: those that optimize internal use and avoid standardization, and those that optimize portability.

My hope is that some of the suggestions below help get at both of those goals in a complementary manner.

Required metadata fields:

These should be rare and new fields should be treated with an abundance of caution. These define the information that must be known in order to safely read any GeoParquet file. These should not assert any nice-to-have standardization of underlying data.

There should be a more gradual process for introducing these, which should include sufficient time for current and future implementers to raise concerns about impacts to performance, ability to safely transform existing input data to meet these requirements, etc. Sometimes we have to raise these with our respective communities in order to better identify issues, and that takes time.

It may be appropriate for required fields to first start out as optional fields while getting buy-in from the ecosystem. Once there is good consensus that a new field absolutely must be present or we'll suffer major errors on deserialization, that is a reasonable time to promote them to required fields.

Optional metadata fields:

These are intended to document properties that are confidentally known about a dataset, in order to support readers of that dataset so that they can better trust the dataset as well as opt-out of standard pre-processing (e.g., if you know winding order on input, you don't have to check underlying geometries and fix it).

Except in very rare cases, the default value of an optional metadata field should be that the specification makes no assertion about the underlying data (aka absent = null); readers are on their own. It can and should encourage writers to provide this information.

What we've seen is that there is a lot of divergence between defaults that seem sensible in theory, and those that are reasonable in practice, and it leads to awkward and avoidable issues within implementations.

When optional fields specify a non-null default, this is a trap. It is logically equivalent to a requirement that says either you state that field=X, or if unstated it MUST be Y. That is effectively a required element at that point, because implementers now need to handle what may indeed be unknown and not safely automatically knowable and coerce that assumed value into either X or Y (aka unsafe / untrustworthy documentation), or prevent writing the data (which is bad for internal use). Thus the default value of optional fields should not be used to make recommendations about underlying data representation.

Instead, optional fields should encourage documenting even the common use cases. E.g., if you know that the encoding is UTF-8 (arbitrary example), then there is no harm in stating it; leaving it unset means that you didn't know it confidently enough to set it, and that setting it when you are not confident is risky.

Provided we default to absent = null, it seems reasonable for optional fields to make recommendations to writers about how to better standardize their data and then document it. E.g., we encourage you to use counterclockwise winding order and document it with orientation="counterclockwise". The emphasis is on documenting according to the spec the data standardization that you've opted in to.

Specification extensions:

There appears to be a real desire to simplify some of the otherwise messy issues of geo data through data standardization. This should be opt-in, because it has real implications for performance, data loss, etc.

Let's define a specification "extension" as a mechanism that leverages all of the existing higher-level required / optional metadata fields AND prescribes specific data representation. It must be set up so that any reader can safely use the data using only the higher-level fields. However, it also signals to readers that if a dataset is of an extension type, some of those fields can be safely ignored and thus avoid some of the complexities of parsing things like CRS WKT. A writer of the extension type must still set those higher-level fields.

I think this gets at some of the ideas originally proposed in some of the default values for optional metadata fields as well as general recommendations within the spec.

I don't use cloud-optimized GeoTIFF yet, but my sense is that what I'm calling an extension type is similar to a COG vs regular GeoTIFF.

For example, let's define an extension type A (because names are hard and distracting) with the following requirements for data representation:

  • data must be in counterclockwise winding order
  • data must be in OGC:CRS84
  • data must have a single geometry type
  • data records must be sorted using the Hilbert curve of the centroid points of their bounding boxes (made this one up just for this example).

In this example, the writer would still set:

  • orientation="counterclockwise"
  • crs=<WKT of OGC:CRS84>
  • geometry_type=<type>
  • whatever optional field is defined re: sort order, if ever

The extension is a bit different than setting those fields on a case-by-case basis, because it can include data standardization not currently expressed via metadata fields, as well as bundling together releated metadata fields. Otherwise, checking them individually within a reader gets to be a bit more complex.

A reader specifically built to consume pre-standardized data that wants to avoid the complexities of mixed CRS, mixed geometry types, etc can specifically look to see if a dataset has extension A set. If so, they can safely opt-out of any extra work to standardize data. If not, they can reject the data outright, or do more involved processing of the higher-level fields.

A writer can allow the user to opt-in to setting this extension. So for example:

  • dataset.to_parquet(filename) => does no extra data standardization or transformation, sets higher-level fields according to the spec
  • dataset.to_parquet(filename, extension="A") => reprojects the data to OGC:CRS84, reorients winding order as needed, sorts the records. Because user intentionally opted-in to this behavior, they are willing to accept the performance impacts and potential loss of data. Because this is optional, the writer can also validate and reject attempts to write data that cannot be coerced to meet the extension; then it is up to the user to standardize / subset / etc their data as needed before attempting to write.

Thus the core idea is that a specification extension allows us to keep a lot of flexibility within the default specification, while still having a path forward that streamlines reading data that can be pre-processed to conform to certain characteristics.

Support 3D coordinates

For this we'll need to decide on EWKB or
ISO WKB

From what everyone could tell the core WKB is the same with both, so we're starting with that. But we should have experts sound in on what we should use.

Initial research shows ISO WKB is more of a 'standard', as EWKB has not been formalized in a document (though it has extensive support). With our CRS field we don't actually need the SRID part of EWKB.

Validation of data, not just metadata?

Should we do anything to try to validate the actual data in a geoparquet file?

The metadata is a relatively small amount of data, which makes it quick to validate even against remote files, while the Parquet file could hold many GBs of data.

Originally posted by @kylebarron in #64 (comment)

Branching strategy for spec development

Now that we have a release we should figure out a branching strategy going forward. I think the primary thing to consider is what version of the spec should a person visiting https://github.com/opengeospatial/geoparquet? The "currently released version, or the in-development version?

My thoughts: we keep things as is for now (just a single main branch that we develop against), at least until we hit 1.0. I think there's a bit more value in people seeing what's coming soon.

After 1.0, I could see making a dev branch. But IMO it's not worth branching until we have a change that's going to require a 1.x (or 2.0), otherwise we'll have to backport / cherry pick typo fixes and clarifications.

How to deal with dynamic CRS or CRS with ensemble datums (such as EPSG:4326)?

From #25 (comment). The spec has a required crs field that stores a WKT2:2019 string representation of the Coordinate Reference System.

We currently recommend using EPSG:4326 for the widest interoperability of the written files. However, this is a dynamic CRS, and in addition uses an ensemble datum. See https://gdal.org/user/coordinate_epoch.html for some context. In summary, when using coordinates with a dynamic CRS, you also need to know the point in time of the observation to know the exact location.

Some discussion topics related to this:

  • How do we deal with a dynamic CRS? We should probably give the option to include the "coordinate epoch" in the metadata (the point in time at which the coordinate is valid)
    • This coordinate epoch is not part of the actual CRS definition, so I think the most straightforward option is to specify an additional (optional) "epoch" field in the column metadata (next to "crs") that holds the epoch value as a decimal year (eg 2021.3).
    • This means we would only support a constant epoch per file. This is in line with the initial support for other formats in GDAL, and we can always later think about expanding this (eg additional field in the parquet file that has a epoch per geometry, or per vertex)

Add PR template to require PR authors to include necessary JSON Schema updates

With #64 we are now saying that the JSON schema file is a core part of the spec, so we should ensure that any PR that changes behavior always updates both the spec and the schema file. I think the easiest way to do this is a just a checkbox nudge on any PR, to help us remember, though I'm open to other ideas.

I suppose that in turn will force us to update examples, but that seems like a good thing.

Example data to test implementations

One idea that @jorisvandenbossche suggested is that we should have a set of data that shows the range of the specification, that implementors can use to make sure they're handling right, and which could be the basis of 'integration' testing.

This would include geoparquet files that have a variety of projections, geometry types (including multiple geometry types as in #119), plus things like multiple geometry columns, different edges values, etc. It could also be good to have a set of 'bad' files that can also be tested against.

Store JSON copy of metadata next to example Parquet files

Making a separate issue to follow up on #60 (comment)

I propose that for our examples we include a JSON file next to each Parquet file that mirrors its metadata. For example:

example1.parquet
# plain text schema that matches what's stored in example1.parquet
example1_schema.json

example2.parquet
# plain text schema that matches what's stored in example2.parquet
example2_schema.json

Benefits:

  • Should make it easier for people browsing the spec to see schema examples without needing to open up a file.
  • Git diffs should be easier to read and understand. Note that this requires that JSON files are always formatted in the same way, which we could enforce via a CI lint step.

We should also have a CI step that verifies the plain text schema in example1_schema.json actually matches what's actually stored in example1.parquet, and ideally that the schema passes validation (see #64 and #58)

Add 'goals'

We should be clear about what is in scope for 1.0.0, and then what is in and out of scope in general. And if we think some things should be in extensions or in the core spec.

Versioning and changes

Hello devs, maintainers and users,
Good day.

This isn't an tech issue per se, however,
I see couple of points here from my experience, discussions
and also given readme.md forms part of the repo,

  1. Since backward compatibility isn't promised under 0.x; from a naïve user point, would this mean change in programming interface or at the storage file level.
  2. Given gdal (3.5) supports the (geo) parquet format already, files generated out of 0.4 would be future readable?
    ref for parquet "https://github.com/apache/parquet-format"

Hope the answers make it to the landing page.

Thanks,

Feature identifiers

Has there been discussion around including an id_column or something similar in the file metadata? I think it would assist in round-tripping features from other formats if it were known which column represented the feature identifier.

It looks like GDAL has a FID layer creation option. But I'm assuming that the information about which column was used when writing would be lost when reading from the parquet file (@rouault would need to confirm).

I grant that this doesn't feel "geo" specific, and there may be existing conventions in Parquet that would be appropriate.

Clarify usage with nested and repeated columns

The Parquet format supports nested and repeated fields. I assume the geometry columns are not limited to the top-level columns, and can be both nested and repeated.

1. Names

The format spec talks about column names, however with nested structure a name might not uniquely identify a column.

I suggest using column path (like "a.b.c") in the docs to avoid the ambiguity. It would coincide with column name in typical case of top-level geometry.

2. Primary column

Can primary column be a nested column, or a repeated column i.e. contain list of geography values?

There is nothing that prevents this in the standard, but I guess the primary-column was designed to be mapped to built-it geometry column in formats like GeoJson or Shape files, and these assume non-repeated top level columns. We can either

  • allow this, and let the tools decide what to do with this, or
  • restrict primary geometry column to be top-level and make primary_column metadata optional so it can be omitted when a GeoParquet file has no top-level geometry.

Advertizing geometry field "type" in Column metadata ?

A common use case if for a geometry column to hold a single geometry type (Point, LineString, Polygon, ...) for all its records. It could be good to have an optional "type" field name under https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#column-metadata to capture that when it is known.

This would help for example conversion between GeoParquet and GIS formats (shapefiles, geopackage) that have typically this information in their metadata.

Values for type could be the ones accepted by GeoJSON: Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, GeometryCollection (that would be extended to CircularString, CompoundCurve, CurvePolygon, MultiCurve, MultiSurface, PolyhedralSurface, TIN if we support ISO WKB, and with Z, M or ZM suffixes for other dimensionalities)

What to do when there is mixed geometry types ?

  • do not set "type"
  • set "type": "mixed"
  • set "type": array of values, e.g. [ "Polygon", "MultiPolygon" ] (this one would be typical when converting from shapefiles where the polygonal type can hold both polygons and multipolygons )

File level metadata field names

At this point the we propose to store metadata at the very top level:
i.e.

metadata = {
    "version": "0.1.0",
    "primary_column": "geometry",
    "columns": {
        "geometry": {
            "crs": df.crs.to_wkt(pyproj.enums.WktVersion.WKT2_2019_SIMPLIFIED),
            "encoding": "WKB",
        }
    }
}

What if there is a metadata field names overlap with other libraries?

As a solution to this issue we can prefix all geoparquet fields, i.e.

metadata = {
    "geoparquet.version": "0.1.0",
    "geoparquet.primary_column": "geometry",
    "geoparquet.columns": {
        "geometry": {
            "crs": df.crs.to_wkt(pyproj.enums.WktVersion.WKT2_2019_SIMPLIFIED),
            "encoding": "WKB",
        }
    }
}

Or make the metadata nested:

metadata = {
    "geoparquet": {
        "version": "0.1.0",
        "primary_column": "geometry",
        "columns": {
            "geometry": {
                "crs": df.crs.to_wkt(pyproj.enums.WktVersion.WKT2_2019_SIMPLIFIED),
                "encoding": "WKB",
            }
        }
    }
}

i.e. Spark resolves everything by putting extra metadata into the spark prefixed fields.

Use RFC 2119 definitions of MUST, SHOULD, etc.?

A number of specs specifically say something like:

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 See mbtiles spec. for example.

Should we do the same? And go through our language to make sure it matches? I'm not sure that we ever did this for STAC. But I think OGC specs usually do.

I don't feel strongly, but making an issue so we discuss and track.

Add an spatial index in geoparquet format as optional

Some geospatial formats have started to include a spatial index flatgeobuf. I think would be great to include it as optional for geo parquet.

I consider it quite interesting because it provides a way to get spatial features (based on bbox for example) from a parquet file without requiring to scan the whole file. It unblocks lots of use cases:

  • Analysis from distributed platforms such as Spark or Dask.
  • Tilers. You can create dynamic tilers directly from parquet files.
  • Visualization based on the viewport. A tool like QGIS might only read the geospatial data inside the user viewport rather than read the whole file.

I'm not an expert in parquet, and it's not clear to me how to implement a geospatial index there. It looks like parquet includes something for indexes at metadata.

image

At the thrift definition it looks the IndexPageHeader is under a TODO comment.

Just opening this to start the discussion on how to properly implement this.

Proposal for a "geo-arrow" format

Various members of the Python and R geospatial communities are working on a geo-arrow-spec: a way to store geospatial data in Apache Arrow (and Apache Parquet) format. This issue is to introduce the cdw-geo and geo-arrow-spec groups, and hash out a plan for how to proceed, since there's some overlap between the two groups' goals.

In addition to intros, I wanted to address Why another parquet format?

The geoparquet format will likely store geometries using something like WKB. The geo-arrow-spec hasn't settled on a representation for geometries (see geoarrow/geoarrow#4 and geoarrow/geoarrow#3), but it will likely move away from using WKB to an "arrow native" memory layout. geoarrow/geoarrow#4 (comment) has more information on why, but the short version is that the arrow-native layout a.) doesn't require decoding WKB to use the geometries, b.) coordinates are contiguous in memory, c.) provides random access to geometries without having to parse unnecessary data.


Some logistical questions (with my recommendations):

  1. Would a version of the geo-arrow-spec be appropriate for inclusion in cdw-geo? (I think so, as long as we think having two parquet formats won't confuse people)
  2. geo-arrow-spec still needs to work out some details, including the memory layout of geometries. Where should development on geo-arrow-spec happen? (I think keep in in geo-arrow-spec, since not everyone in cdw-geo will care. Once it's more stable, we can consider moving it out of geopandas/geo-arrow-spec and into this repository, if both groups agree).

cc @jorisvandenbossche from the geo-arrow-spec side.

bbox crossing the antimeridian

Currently the specification says:

The bbox, if specified, must be encoded with an array containing the minimum and maximum values of each dimension: [<xmin>, <ymin>, <xmax>, <ymax>]. This follows the GeoJSON specification (RFC 7946, section 5).

"Minimum" and "maximum" are not correct for cases where the bbox crosses the dateline, at least if GeoJSON is used as the basis. See GeoJSON 5.2, the following is a rough bbox for the Fiji archipelago, spanning 5 degrees of longitude: [177.0, -20.0, -178.0, -16.0].

If GeoJSON is followed, the wording should be updated/clarified. If it should be strictly minimum/maximum, then the GeoJSON reference should be removed.

Happy to create a PR for review, but it would help to understand the intention for GeoParquet. I personally would prefer the first option and be consistent with GeoJSON, which is also what is used in OGC API Features and I think STAC.

Make JSON Schema definition a core part of the specification

In #7 we decided to store metadata as a JSON-encoded blob. In #58, which adds a schema validator, a JSON schema definition is included, but not prominently featured as a core part of the GeoParquet specification.

Learning from previous specs like STAC (which includes JSON Schema definitions for each part of the spec, and allows for a wide range of tools leveraging the JSON schema), I propose that we make this JSON Schema definition a core part of the specification.

Benefits include:

  • Standardized way to describe JSON data.
    • We currently have some human-readable notes like #28 that can be succinctly and robustly described in JSON Schema.
  • Assists in making validators: JSON Schema validators exist in many different languages: https://json-schema.org/implementations.html#validators
  • Allows for a rigorous description of schema changes. A change to the spec can often be described more exactly via JSON Schema than via text in geoparquet.md

Should Z bounds be included in the bbox field?

In the documentation for bbox, it seems to indicate that the bbox is always just [<xmin>, <ymin>, <xmax>, <ymax>]: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#bbox . The linked RFC also allows for a 3D bounding box, but the GDAL driver doesn't write the bounding box that way (see OSGeo/gdal#5670). Is the intention to include Z in the bounding box if the Z coordinate is present?

My vote would be to not include Z in the bbox because it's not all that useful and makes it a little bit more annoying to implement.

Make sure examples for 0.1 are valid or removed

Right now our 'examples' were made by Tom, and I'm pretty sure are out of date with the tweaks we've made.

If there's code that can make a 0.1 example before we release it'd be great to include. But if not we should just remove the examples (perhaps we could do a 0.1.1 release with examples). If it's too much work to get an example just remove the example folder, or assign to me and I can do it.

Create a 'validator' to help implementors check if they are compliant

This isn't about the spec itself, but we should track progress towards tools that make it easy to check if a given file is valid GeoParquet. Ideally it'd also give warnings about any best practices - like check for bbox and explain why it's recommended.

It should help any implementor ensure that they are producing what we specified.

I have no idea of the level of effort on this, so if it's easy we can pull it forward in milestones, and it'd be good to have before we try to get lots of implementations.

How should metadata be written in a partitioned dataset?

So far the spec has only covered single-file Parquet data. However Parquet also supports saving as a "dataset", where there are several Parquet files in a folder structure. In this case, how should geospatial metadata be stored? There's a Parquet best practice that writes _common_metadata and _metadata sidecar files to the root of the folder structure, but that's not part of the actual Parquet specification.

If I understand correctly, the geo metadata would automatically be included in the _common_metadata file, and additionally statistics are stored in the _metadata file, which is relevant for #13

Should this be part of the geoparquet spec? Should it be a "best practice" that we document?

geoparquet coordinate reference system default / format / extension

There are a lot of options for how we approach coordinate reference systems.

  • GeoJSON only allows 4326 data. They started with more options, but then narrowed it down.
  • Simple Features for SQL defines an 'SRID table' where you are supposed to map number id's to crs well known text. PostGIS uses the same srid's as epsg, but oracle doesn't.
  • WKT doesn't include projection info, but that's seen as a weakness, and one main reason why ewkt came. I believe ewkt just assumes epsg / postgis default srid table, so it's not fully interoperable with Oracle.
  • STAC uses epsg, but you can set it to null and provide projjson or crs-wkt
  • OGC API - Features core specifies just WGS-84 (long, lat), using a URI like http://www.opengis.net/def/crs/OGC/1.3/CRS84, see crs info

And there's obviously more.

My general take is that we should have a default, and expect most things to use that. But should specify it in a way that it could be an extension in the future. So we shouldn't just say 'everything is 4326' just in the spec, but should have a field that says this field is always 4326 for the core spec, but in the future that field could have other values.

So I think we do the first version with just 4326, and then when people ask for more we can have an extension.

One thing I'm not sure about is whether we should use 'epsg' as the field. EPSG covers most projections people want, but not all. In geopackage they just create a whole srid table to then refer to, so the SRID's used are defined. Usually the full epsg database is included, but then users can add other options.

One potential option would be to follow ogc api - features and use URI's. I'm not sure how widely accepted that approach is, like if the full epsg database is already referenced online. So instead of 'epsg' as the field we'd have 'crs', and it's a string URI.

Flesh out readme

This repo is now devoted to geoparquet. It should have a good readme to give people a clear picture of what they're getting involved in.

Consider externalizability of metadata

When [Geo]Parquet files/sources are used within systems that treat them as tables (like Spark, Trino/Presto, Athena, etc.), basic Parquet metadata is tracked in a "catalog" (e.g., a Hive-compatible catalog like AWS Glue Catalog). The engine being used for querying uses metadata to limit the parts of files (and files themselves) that are scanned, but they only expose the columnar content that's present, not the metadata. In some cases, metadata can be queried from the catalogs (e.g., from Athena, but the catalogs need additional work to support the metadata specified by GeoParquet (and this largely hasn't been done yet).

In the meantime, I'm curious if it makes sense to take the same metadata that's contained in the footer and externalize it into an alternate file (which could be serialized as Parquet, Avro, JSON, etc.). This would allow the query engines to register the metadata as a separate "table" (query-able as a standard source vs. requiring catalog support) and surface/take advantage of "table"-level information like CRS at query-time. At the moment, the CRS of a geometry column is something that needs to be determined out of band.

This is somewhat similar to #79, in that it doesn't look at GeoParquet sources as "files" ("tables" are often backed by many files), and could be seen as another reason to (de-)duplicate data from file footers into something that covers the whole set.

/cc @jorisvandenbossche and @kylebarron, since we talked a bit about this at FOSS4G.

bbox: Which CRS?

Is the bbox to be given in the CRS induced by GeoJSON or the CRS given in the crs field?
The description says "formatted according to RFC 7946, section 5", but doesn't say what exactly that includes/excludes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.