Has there been discussion around including an id_column</cod

Do GDAL types not map 1:1 to <a href="https://arrow.apache.org/docs/forma

GDAL currently uses an extension gdal:schema</c

Feature identifiers about geoparquet HOT 11 CLOSED

tschaub commented on May 20, 2024

Feature identifiers

from geoparquet.

Comments (11)

tschaub commented on May 20, 2024 2

If we want to support this, I think a simple new (and optional) field in our geo metadata where you can say which column is the ID column would be sufficient? (similarly to what GDAL does now, i.e. "fid":"fid")

I was envisioning the same. Though unless space is a concern, I think “id_column” is a bit friendlier and fits well with “primary_column” (nit).

from geoparquet.

rouault commented on May 20, 2024 1

Do GDAL types not map 1:1 to Arrow types?

at 99%, but they are subtelties. Like GDAL can have a hint for the maximum width of a string, or JSON or UUID "subtypes" for strings. Those are generally not essential metadata, but GDAL can write them for perfect round-tripping of its abstraction model.

from geoparquet.

jorisvandenbossche commented on May 20, 2024 1

I would personally not follow (or be inspired by) the pandas' metadata here. The information in there is very pandas specific.

If we want to support this, I think a simple new (and optional) field in our geo metadata where you can say which column is the ID column would be sufficient? (similarly to what GDAL does now, i.e. "fid":"fid")

One other thing to note is that pandas stores its metadata in two places: once in the file metadata under the pandas key, but again in the file metadata within the Arrow schema metadata. Arrow also encodes its schema in a file-level metadata key named 'ARROW:schema' and is a base64 encoding of a custom binary format (I forget exactly where this layout is defined, @jorisvandenbossche maybe would know?) so that it's possible to exactly round trip Arrow data through Parquet without losing metadata like for extension types.

@kylebarron yeah, this duplication of the pandas metadata in both the normal Parquet metadata and inside the serialized schema is not super ideal. There is a JIRA about this: https://issues.apache.org/jira/browse/ARROW-14303
The format itself is the IPC message for a schema, and then base64 encoded: https://arrow.apache.org/docs/dev/cpp/parquet.html#serialization-details
But, as you mention yourself, I don't think it is interesting to look at this place to put FID information, as that is not readily available for Parquet readers that are not based on an Arrow library.

from geoparquet.

rouault commented on May 20, 2024

GDAL currently uses an extension gdal:schema metadata domain where it puts information such as the fid column name or GDAL specific typing.
e.g:

{"fid":"fid","columns":{"AREA":{"type":"Real"},"EAS_ID":{"type":"Integer64"},"PRFEDEA":{"type":"String","width":16}}}

from geoparquet.

kylebarron commented on May 20, 2024

I think several libraries have come up with their own custom metadata to solve this problem; not sure there's any "native" Parquet solution. For example, Pandas adds its own metadata for its index columns (essentially equivalent to a feature id column) and data types so that it's reliably able to round-trip data.

import pandas as pd
import pyarrow.parquet as pq
import json

df = pd.DataFrame({'a': [2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd']}).set_index('a')
df.to_parquet('test.parquet')

meta = pq.read_metadata('test.parquet')
json.loads(meta.metadata[b'pandas'])

{'index_columns': ['a'],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': {'encoding': 'UTF-8'}}],
 'columns': [{'name': 'b',
   'field_name': 'b',
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': None},
  {'name': 'a',
   'field_name': 'a',
   'pandas_type': 'int64',
   'numpy_type': 'int64',
   'metadata': None}],
 'creator': {'library': 'pyarrow', 'version': '7.0.0'},
 'pandas_version': '1.3.1'}

from geoparquet.

cholmes commented on May 20, 2024

Interesting. Seems like it might be good for GeoParquet to at least make a recommendation for reliable FID. Perhaps just use what Pandas does, and then GeoPandas and GDAL could align?

from geoparquet.

TomAugspurger commented on May 20, 2024

I'll note that pandas uses a list of fields for index_columns, to support its MultiIndex (I think similar to a composite key in some flavors of SQL). It looks like GDAL uses a string, at least in Even's example.

from geoparquet.

rouault commented on May 20, 2024

It looks like GDAL uses a string, at least in Even's example.

yes, GDAL only supports a single numeric column as feature identifier

from geoparquet.

kylebarron commented on May 20, 2024

Perhaps just use what Pandas does, and then GeoPandas and GDAL could align?

I'd be hesitant to mimic the Pandas metadata exactly because it's very Python-specific, at least the pandas_type and numpy_type.

One other thing to note is that pandas stores its metadata in two places: once in the file metadata under the pandas key, but again in the file metadata within the Arrow schema metadata. Arrow also encodes its schema in a file-level metadata key named 'ARROW:schema' and is a base64 encoding of a custom binary format (I forget exactly where this layout is defined, @jorisvandenbossche maybe would know?) so that it's possible to exactly round trip Arrow data through Parquet without losing metadata like for extension types.

# continuing from above
meta.metadata[b'ARROW:schema']
# b'/////6gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAAQCAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAADcAQAABAAAAM0BAAB7ImluZGV4X2NvbHVtbnMiOiBbImEiXSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVuY29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogImIiLCAiZmllbGRfbmFtZSI6ICJiIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogImEiLCAiZmllbGRfbmFtZSI6ICJhIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfV0sICJjcmVhdG9yIjogeyJsaWJyYXJ5IjogInB5YXJyb3ciLCAidmVyc2lvbiI6ICI3LjAuMCJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMS4zLjEifQAAAAYAAABwYW5kYXMAAAIAAABMAAAABAAAAMz///8AAAECEAAAABwAAAAEAAAAAAAAAAEAAABhAAAACAAMAAgABwAIAAAAAAAAAUAAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQUQAAAAGAAAAAQAAAAAAAAAAQAAAGIAAAAEAAQABAAAAA=='

# Function to read and parse the above buffer
arrow_schema = pq.read_schema('test.parquet')
print(arrow_schema)
# b: string
# a: int64
# -- schema metadata --
# pandas: '{"index_columns": ["a"], "column_indexes": [{"name": null, "fiel' ...

print(arrow_schema.metadata)
# {b'pandas': b'{"index_columns": ["a"], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "b", "field_name": "b", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], "creator": {"library": "pyarrow", "version": "7.0.0"}, "pandas_version": "1.3.1"}'}

Having metadata in the arrow schema might make it more interoperable than in the pandas-specific metadata, but not usable for non-Arrow-based readers (the Java world I think). If we want to be able to round-trip from e.g. GeoJSON which has feature identifiers, maybe it would make sense to add an option to the geoparquet-specific metadata describing an id column (though we'd have to guard against our metadata being out of sync with other metadata)

from geoparquet.

kylebarron commented on May 20, 2024

GDAL currently uses an extension gdal:schema metadata domain where it puts information such as the fid column name or GDAL specific typing. e.g:

Other than the fid, is there a reason why GDAL can't reuse the Arrow schema metadata, given that GDAL is using the Arrow C++ libraries to read/write Parquet? Do GDAL types not map 1:1 to Arrow types?

from geoparquet.

cholmes commented on May 20, 2024

Call 10/24 says we should add some 'best practice' that says parquet doesn't have a primary key, so it's not part of this spec. GDAL should do what it does, if other systems are also interest in roundtripping a feature id then we'd consider it as part of the spec.

from geoparquet.

Feature identifiers about geoparquet HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent