Giter Site home page Giter Site logo

Feature identifiers about geoparquet HOT 11 CLOSED

tschaub avatar tschaub commented on May 20, 2024
Feature identifiers

from geoparquet.

Comments (11)

tschaub avatar tschaub commented on May 20, 2024 2

If we want to support this, I think a simple new (and optional) field in our geo metadata where you can say which column is the ID column would be sufficient? (similarly to what GDAL does now, i.e. "fid":"fid")

I was envisioning the same. Though unless space is a concern, I think “id_column” is a bit friendlier and fits well with “primary_column” (nit).

from geoparquet.

rouault avatar rouault commented on May 20, 2024 1

Do GDAL types not map 1:1 to Arrow types?

at 99%, but they are subtelties. Like GDAL can have a hint for the maximum width of a string, or JSON or UUID "subtypes" for strings. Those are generally not essential metadata, but GDAL can write them for perfect round-tripping of its abstraction model.

from geoparquet.

jorisvandenbossche avatar jorisvandenbossche commented on May 20, 2024 1

I would personally not follow (or be inspired by) the pandas' metadata here. The information in there is very pandas specific.

If we want to support this, I think a simple new (and optional) field in our geo metadata where you can say which column is the ID column would be sufficient? (similarly to what GDAL does now, i.e. "fid":"fid")

One other thing to note is that pandas stores its metadata in two places: once in the file metadata under the pandas key, but again in the file metadata within the Arrow schema metadata. Arrow also encodes its schema in a file-level metadata key named 'ARROW:schema' and is a base64 encoding of a custom binary format (I forget exactly where this layout is defined, @jorisvandenbossche maybe would know?) so that it's possible to exactly round trip Arrow data through Parquet without losing metadata like for extension types.

@kylebarron yeah, this duplication of the pandas metadata in both the normal Parquet metadata and inside the serialized schema is not super ideal. There is a JIRA about this: https://issues.apache.org/jira/browse/ARROW-14303
The format itself is the IPC message for a schema, and then base64 encoded: https://arrow.apache.org/docs/dev/cpp/parquet.html#serialization-details
But, as you mention yourself, I don't think it is interesting to look at this place to put FID information, as that is not readily available for Parquet readers that are not based on an Arrow library.

from geoparquet.

rouault avatar rouault commented on May 20, 2024

GDAL currently uses an extension gdal:schema metadata domain where it puts information such as the fid column name or GDAL specific typing.
e.g:

{"fid":"fid","columns":{"AREA":{"type":"Real"},"EAS_ID":{"type":"Integer64"},"PRFEDEA":{"type":"String","width":16}}}

from geoparquet.

kylebarron avatar kylebarron commented on May 20, 2024

I think several libraries have come up with their own custom metadata to solve this problem; not sure there's any "native" Parquet solution. For example, Pandas adds its own metadata for its index columns (essentially equivalent to a feature id column) and data types so that it's reliably able to round-trip data.

import pandas as pd
import pyarrow.parquet as pq
import json

df = pd.DataFrame({'a': [2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd']}).set_index('a')
df.to_parquet('test.parquet')

meta = pq.read_metadata('test.parquet')
json.loads(meta.metadata[b'pandas'])
{'index_columns': ['a'],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': {'encoding': 'UTF-8'}}],
 'columns': [{'name': 'b',
   'field_name': 'b',
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': None},
  {'name': 'a',
   'field_name': 'a',
   'pandas_type': 'int64',
   'numpy_type': 'int64',
   'metadata': None}],
 'creator': {'library': 'pyarrow', 'version': '7.0.0'},
 'pandas_version': '1.3.1'}

from geoparquet.

cholmes avatar cholmes commented on May 20, 2024

Interesting. Seems like it might be good for GeoParquet to at least make a recommendation for reliable FID. Perhaps just use what Pandas does, and then GeoPandas and GDAL could align?

from geoparquet.

TomAugspurger avatar TomAugspurger commented on May 20, 2024

I'll note that pandas uses a list of fields for index_columns, to support its MultiIndex (I think similar to a composite key in some flavors of SQL). It looks like GDAL uses a string, at least in Even's example.

from geoparquet.

rouault avatar rouault commented on May 20, 2024

It looks like GDAL uses a string, at least in Even's example.

yes, GDAL only supports a single numeric column as feature identifier

from geoparquet.

kylebarron avatar kylebarron commented on May 20, 2024

Perhaps just use what Pandas does, and then GeoPandas and GDAL could align?

I'd be hesitant to mimic the Pandas metadata exactly because it's very Python-specific, at least the pandas_type and numpy_type.

One other thing to note is that pandas stores its metadata in two places: once in the file metadata under the pandas key, but again in the file metadata within the Arrow schema metadata. Arrow also encodes its schema in a file-level metadata key named 'ARROW:schema' and is a base64 encoding of a custom binary format (I forget exactly where this layout is defined, @jorisvandenbossche maybe would know?) so that it's possible to exactly round trip Arrow data through Parquet without losing metadata like for extension types.

# continuing from above
meta.metadata[b'ARROW:schema']
# b'/////6gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAAQCAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAADcAQAABAAAAM0BAAB7ImluZGV4X2NvbHVtbnMiOiBbImEiXSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVuY29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogImIiLCAiZmllbGRfbmFtZSI6ICJiIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogImEiLCAiZmllbGRfbmFtZSI6ICJhIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfV0sICJjcmVhdG9yIjogeyJsaWJyYXJ5IjogInB5YXJyb3ciLCAidmVyc2lvbiI6ICI3LjAuMCJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMS4zLjEifQAAAAYAAABwYW5kYXMAAAIAAABMAAAABAAAAMz///8AAAECEAAAABwAAAAEAAAAAAAAAAEAAABhAAAACAAMAAgABwAIAAAAAAAAAUAAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQUQAAAAGAAAAAQAAAAAAAAAAQAAAGIAAAAEAAQABAAAAA=='

# Function to read and parse the above buffer
arrow_schema = pq.read_schema('test.parquet')
print(arrow_schema)
# b: string
# a: int64
# -- schema metadata --
# pandas: '{"index_columns": ["a"], "column_indexes": [{"name": null, "fiel' ...

print(arrow_schema.metadata)
# {b'pandas': b'{"index_columns": ["a"], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "b", "field_name": "b", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], "creator": {"library": "pyarrow", "version": "7.0.0"}, "pandas_version": "1.3.1"}'}

Having metadata in the arrow schema might make it more interoperable than in the pandas-specific metadata, but not usable for non-Arrow-based readers (the Java world I think). If we want to be able to round-trip from e.g. GeoJSON which has feature identifiers, maybe it would make sense to add an option to the geoparquet-specific metadata describing an id column (though we'd have to guard against our metadata being out of sync with other metadata)

from geoparquet.

kylebarron avatar kylebarron commented on May 20, 2024

GDAL currently uses an extension gdal:schema metadata domain where it puts information such as the fid column name or GDAL specific typing. e.g:

Other than the fid, is there a reason why GDAL can't reuse the Arrow schema metadata, given that GDAL is using the Arrow C++ libraries to read/write Parquet? Do GDAL types not map 1:1 to Arrow types?

from geoparquet.

cholmes avatar cholmes commented on May 20, 2024

Call 10/24 says we should add some 'best practice' that says parquet doesn't have a primary key, so it's not part of this spec. GDAL should do what it does, if other systems are also interest in roundtripping a feature id then we'd consider it as part of the spec.

from geoparquet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.