Comments (11)
If we want to support this, I think a simple new (and optional) field in our geo metadata where you can say which column is the ID column would be sufficient? (similarly to what GDAL does now, i.e. "fid":"fid")
I was envisioning the same. Though unless space is a concern, I think “id_column” is a bit friendlier and fits well with “primary_column” (nit).
from geoparquet.
Do GDAL types not map 1:1 to Arrow types?
at 99%, but they are subtelties. Like GDAL can have a hint for the maximum width of a string, or JSON or UUID "subtypes" for strings. Those are generally not essential metadata, but GDAL can write them for perfect round-tripping of its abstraction model.
from geoparquet.
I would personally not follow (or be inspired by) the pandas' metadata here. The information in there is very pandas specific.
If we want to support this, I think a simple new (and optional) field in our geo metadata where you can say which column is the ID column would be sufficient? (similarly to what GDAL does now, i.e. "fid":"fid"
)
One other thing to note is that pandas stores its metadata in two places: once in the file metadata under the
pandas
key, but again in the file metadata within the Arrow schema metadata. Arrow also encodes its schema in a file-level metadata key named'ARROW:schema'
and is a base64 encoding of a custom binary format (I forget exactly where this layout is defined, @jorisvandenbossche maybe would know?) so that it's possible to exactly round trip Arrow data through Parquet without losing metadata like for extension types.
@kylebarron yeah, this duplication of the pandas metadata in both the normal Parquet metadata and inside the serialized schema is not super ideal. There is a JIRA about this: https://issues.apache.org/jira/browse/ARROW-14303
The format itself is the IPC message for a schema, and then base64 encoded: https://arrow.apache.org/docs/dev/cpp/parquet.html#serialization-details
But, as you mention yourself, I don't think it is interesting to look at this place to put FID information, as that is not readily available for Parquet readers that are not based on an Arrow library.
from geoparquet.
GDAL currently uses an extension gdal:schema
metadata domain where it puts information such as the fid column name or GDAL specific typing.
e.g:
{"fid":"fid","columns":{"AREA":{"type":"Real"},"EAS_ID":{"type":"Integer64"},"PRFEDEA":{"type":"String","width":16}}}
from geoparquet.
I think several libraries have come up with their own custom metadata to solve this problem; not sure there's any "native" Parquet solution. For example, Pandas adds its own metadata for its index columns (essentially equivalent to a feature id column) and data types so that it's reliably able to round-trip data.
import pandas as pd
import pyarrow.parquet as pq
import json
df = pd.DataFrame({'a': [2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd']}).set_index('a')
df.to_parquet('test.parquet')
meta = pq.read_metadata('test.parquet')
json.loads(meta.metadata[b'pandas'])
{'index_columns': ['a'],
'column_indexes': [{'name': None,
'field_name': None,
'pandas_type': 'unicode',
'numpy_type': 'object',
'metadata': {'encoding': 'UTF-8'}}],
'columns': [{'name': 'b',
'field_name': 'b',
'pandas_type': 'unicode',
'numpy_type': 'object',
'metadata': None},
{'name': 'a',
'field_name': 'a',
'pandas_type': 'int64',
'numpy_type': 'int64',
'metadata': None}],
'creator': {'library': 'pyarrow', 'version': '7.0.0'},
'pandas_version': '1.3.1'}
from geoparquet.
Interesting. Seems like it might be good for GeoParquet to at least make a recommendation for reliable FID. Perhaps just use what Pandas does, and then GeoPandas and GDAL could align?
from geoparquet.
I'll note that pandas uses a list of fields for index_columns
, to support its MultiIndex (I think similar to a composite key in some flavors of SQL). It looks like GDAL uses a string, at least in Even's example.
from geoparquet.
It looks like GDAL uses a string, at least in Even's example.
yes, GDAL only supports a single numeric column as feature identifier
from geoparquet.
Perhaps just use what Pandas does, and then GeoPandas and GDAL could align?
I'd be hesitant to mimic the Pandas metadata exactly because it's very Python-specific, at least the pandas_type
and numpy_type
.
One other thing to note is that pandas stores its metadata in two places: once in the file metadata under the pandas
key, but again in the file metadata within the Arrow schema metadata. Arrow also encodes its schema in a file-level metadata key named 'ARROW:schema'
and is a base64 encoding of a custom binary format (I forget exactly where this layout is defined, @jorisvandenbossche maybe would know?) so that it's possible to exactly round trip Arrow data through Parquet without losing metadata like for extension types.
# continuing from above
meta.metadata[b'ARROW:schema']
# b'/////6gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAAQCAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAADcAQAABAAAAM0BAAB7ImluZGV4X2NvbHVtbnMiOiBbImEiXSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVuY29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogImIiLCAiZmllbGRfbmFtZSI6ICJiIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogImEiLCAiZmllbGRfbmFtZSI6ICJhIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfV0sICJjcmVhdG9yIjogeyJsaWJyYXJ5IjogInB5YXJyb3ciLCAidmVyc2lvbiI6ICI3LjAuMCJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMS4zLjEifQAAAAYAAABwYW5kYXMAAAIAAABMAAAABAAAAMz///8AAAECEAAAABwAAAAEAAAAAAAAAAEAAABhAAAACAAMAAgABwAIAAAAAAAAAUAAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQUQAAAAGAAAAAQAAAAAAAAAAQAAAGIAAAAEAAQABAAAAA=='
# Function to read and parse the above buffer
arrow_schema = pq.read_schema('test.parquet')
print(arrow_schema)
# b: string
# a: int64
# -- schema metadata --
# pandas: '{"index_columns": ["a"], "column_indexes": [{"name": null, "fiel' ...
print(arrow_schema.metadata)
# {b'pandas': b'{"index_columns": ["a"], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "b", "field_name": "b", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], "creator": {"library": "pyarrow", "version": "7.0.0"}, "pandas_version": "1.3.1"}'}
Having metadata in the arrow schema might make it more interoperable than in the pandas-specific metadata, but not usable for non-Arrow-based readers (the Java world I think). If we want to be able to round-trip from e.g. GeoJSON which has feature identifiers, maybe it would make sense to add an option to the geoparquet-specific metadata describing an id column (though we'd have to guard against our metadata being out of sync with other metadata)
from geoparquet.
GDAL currently uses an extension
gdal:schema
metadata domain where it puts information such as the fid column name or GDAL specific typing. e.g:
Other than the fid
, is there a reason why GDAL can't reuse the Arrow schema metadata, given that GDAL is using the Arrow C++ libraries to read/write Parquet? Do GDAL types not map 1:1 to Arrow types?
from geoparquet.
Call 10/24 says we should add some 'best practice' that says parquet doesn't have a primary key, so it's not part of this spec. GDAL should do what it does, if other systems are also interest in roundtripping a feature id then we'd consider it as part of the spec.
from geoparquet.
Related Issues (20)
- PROJJSON for CRS, WKT for CRS and ISO19111 HOT 6
- WKT support for 3/4D using Z and/or M HOT 10
- Schema version invalid HOT 11
- Simplify or remove script dependencies HOT 3
- PROJJSON schema version HOT 4
- Metadata encoding options for GeoArrow-encoded columns in GeoParquet metadata HOT 2
- Is it possible to define a transform alongside a CRS, similar to geotiff? HOT 3
- Recommendation on the Arrow specific type for the WKB geometry column ? HOT 5
- Antimeridian Crossings and bbox HOT 4
- Update example files for 1.1 HOT 2
- The releases on the repository can be misleading regarding the status of GeoParquet as an OGC Standard
- Clarify projection of bounding box columns HOT 2
- Mixed concerns: Encoding + Geometry Type HOT 5
- Covering Schema
- Clarify recommended file extension HOT 9
- List of Submitting Organisations HOT 2
- Enforce pull requests and approvals for all repository updates HOT 4
- Require status checks to pass before merging HOT 4
- Synchronise requirements in the metanorma asciidoc files with those in the gpq validator HOT 1
- add support wkt or wkt2 formats for crs HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from geoparquet.