Giter Site home page Giter Site logo

hanielshafei / geoparquet Goto Github PK

View Code? Open in Web Editor NEW

This project forked from opengeospatial/geoparquet

0.0 0.0 0.0 186 KB

Specification for storing geospatial vector data (point, line, polygon) in Parquet

License: Apache License 2.0

Python 100.00%

geoparquet's Introduction

GeoParquet

About

This repository defines a specification for how to store geospatial vector data (point, lines, polygons) in Apache Parquet, a popular columnar storage format for tabular data - see this vendor explanation for more on what that means. Our goal is to standardize how geospatial data is represented in Parquet to further geospatial interoperability among tools using Parquet today, and hopefully help push forward what's possible with 'cloud-native geospatial' workflows. Learn more at geoparquet.org

Warning: This specification is currently in 1.0 'beta' releases, which means we don't anticipate anything major will change, but still reserve the right to make changes in backwards incompatible ways. See the versioning section below for more info. If you are excited about the potential please collaborate with us by building implementations, sounding in on the issues and contributing PR's! Note that the specification in this repository is in flux, we recommend using the latest published release.

Early contributors include developers from GeoPandas, GeoTrellis, OpenLayers, Vis.gl, Voltron Data, Microsoft, Carto, Azavea, Planet & Unfolded. Anyone is welcome to join us, by building implementations, trying it out, giving feedback through issues and contributing to the spec via pull requests. Initial work started in the geo-arrow-spec GeoPandas repository, and that will continue on Arrow work in a compatible way, with this specification focused solely on Parquet.

Goals

There are a few core goals driving the initial development.

  • Establish a great geospatial format for workflows that excel with columnar data - Most data science and 'business intelligence' workflows have been moving towards columnar data, but current geospatial formats can not be as efficiently loaded as other data. So we aim to bring geospatial data best practices to one of the most popular formats, and hopefully establish a good pattern for how to do so.
  • Introduce columnar data formats to the geospatial world - And most of the geospatial world is not yet benefitting from all the breakthroughs in data analysis in the broader IT world, so we are excited to enable interesting geospatial analysis with a wider range of tools.
  • Enable interoperability among cloud data warehouses - BigQuery, Snowflake, Redshift and others all support spatial operations but importing and exporting data with existing formats can be problematic. All support and often recommend Parquet, so defining a solid GeoParquet can help enable interoperability.
  • Persist geospatial data from Apache Arrow - GeoParquet is developed in parallel with a GeoArrow spec, to enable cross-language in-memory analytics of geospatial information with Arrow. Parquet is already well-supported by Arrow as the key on disk persistance format.

And our broader goal is to innovate with 'cloud-native vector' providing a stable base to try out new ideas for cloud-native & streaming workflows.

Features

A quick overview of what GeoParquet supports (or at least plans to support).

  • Multiple spatial reference systems - Many tools will use GeoParquet for high-performance analysis, so it's important to be able to use data in its native projection. But we do provide a clear default recommendation to better enable interoperability, giving a clear target for implementations that don't want to worry about projections.
  • Multiple geometry columns - There is a default geometry column, but additional geometry columns can be included.
  • Great compression / small files - Parquet is designed to compress very well, so data benefits by taking up less disk space & being more efficient over the network.
  • Work with both planar and spherical coordinates - Most cloud data warehouses support spherical coordinates, and so GeoParquet aims to help persist those and be clear about what is supported.
  • Great at read-heavy analytic workflows - Columnar formats enable cheap reading of a subset of columns, and Parquet in particular enables efficient filtering of chunks based on column statistics, so the format will perform well in a variety of modern analytic workflows.
  • Support for data partitioning - Parquet has a nice ability to partition data into different files for efficiency, and we aim to enable geospatial partitions.

It should be noted what GeoParquet is less good for. The biggest one is that it is not a good choice for write-heavy interactions. A row-based format will work much better if it is backing a system that is constantly updating the data and adding new data.

Roadmap

Our aim is to get to a 1.0.0 in the next few months. The next steps are to push towards our 1.0.0 acceptance criteria, articulated in this discussion post, which mostly involves getting great validation tools, more software implementations, and more data providers. This should provide key feedback to feel confident in going to 1.0. The goal of 1.0.0 is to establish a baseline of interoperability for geospatial information in Parquet. We've got some exciting plans to go beyond that, which we'll articulate soon on the website.

Versioning

After we reach version 1.0 we will follow SemVer, so at that point any breaking change will require the spec to go to 2.0.0. Currently implementors should expect breaking changes, though at some point, hopefully relatively soon (0.4?), we will declare that we don't think there will be any more potential breaking changes. Though the full commitment to that won't be made until 1.0.0.

Current Implementations & Examples

Examples of GeoParquet files following the current spec can be found in the examples/ folder. For information on all the tools and libraries implementing GeoParquet, as well as sample data, see the implementations section of the website.

geoparquet's People

Contributors

cholmes avatar jorisvandenbossche avatar tomaugspurger avatar tschaub avatar kylebarron avatar m-mohr avatar alasarr avatar rouault avatar jesus89 avatar felixpalmer avatar brendan-ward avatar cayetanobv avatar jiayuasu avatar jzb avatar yeelauren avatar evetion avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.