Giter Site home page Giter Site logo

geoarrow-r's Introduction

geoarrow

Codecov test coverage

The goal of geoarrow is to leverage the features of the arrow package and larger Apache Arrow ecosystem for geospatial data. The geoarrow package provides an R implementation of the GeoParquet file format of and the draft geoarrow data specification, defining extension array types for vector geospatial data.

Installation

You can install the development version of geoarrow from GitHub with:

# install.packages("pak")
pak::pak("geoarrow/geoarrow-r")

Example

This is a basic example which shows you how to solve a common problem:

library(geoarrow)

geoarrow-r's People

Contributors

paleolimbot avatar mikemahoney218 avatar

Stargazers

Sarah Johnson avatar Anthony Flores avatar Ernest Guevarra avatar Kyle Walker avatar Olivier Leroy avatar Eivind Gard Lund avatar Santiago Mota avatar  avatar Brayden Youngberg avatar  avatar Andrew Gene Brown avatar Ivann avatar Kit Macleod avatar Jens Wiesehahn avatar Tiernan Martin avatar Roberto Villegas-Diaz avatar Aaron Wolen avatar Philip Orlando avatar Josiah Parry avatar  avatar Benjamin Hlina avatar Andrew Allen Bruce avatar Jaxine Wolfe avatar Paul Campbell avatar Krzysztof Sakrejda avatar Lambda Moses avatar Raphael Saldanha avatar Zehui Yin avatar Rafael H M Pereira avatar Kamil Raczycki avatar Stuart Brown avatar Arthur Gailes avatar Ran Li avatar  avatar Justin Millar avatar Jeffrey W Hollister avatar Robert Edwards avatar Felipe Carvalho avatar Antony Barja  avatar Igor Suhorukov avatar Vladimir Mikryukov avatar Alexey Shiklomanov avatar Daniel Possenriede avatar Ardie Orden avatar Lance Fisher avatar Eirik avatar João Pedro Vieira avatar Brad Lindblad avatar Grant McDermott avatar  avatar Anthony North avatar Tao Jianhang avatar Alberto Lázaro-López avatar Bert Temme avatar Erik Johnson avatar Ryan Peek avatar Juan Ignacio Fulponi avatar Nicholas Erskine avatar Dave avatar ringsaturn avatar Dan Snow avatar Zac Nelson avatar Fabion Kauker avatar Taras Novak avatar Andrew Foltz-Morrison avatar Ning Sun avatar Dmytro Koval avatar  avatar Simon F avatar Christopher Kenny avatar  avatar Mike O'Brien avatar Jonathan Lurie avatar  avatar Kevin Frac avatar yalei avatar Trang Le avatar Robin Penfold avatar Joe Marlo avatar  avatar Christos avatar Uchida Mizuki avatar Merlin Unterfinger avatar Felipe avatar Noah avatar Nik Humphries avatar Lucas Johnson avatar Nick Gauthier avatar Lauren Yee avatar luca avatar  avatar Clemens Schmid avatar Kenneth Blake Vernon avatar Juan Gabriel Juara avatar Philip Nelson avatar Kristen Downs avatar Shayne Qian avatar Marco Sciaini avatar  avatar Ian Flores Siaca  avatar

Watchers

James Cloos avatar Joris Van den Bossche avatar Michael Sumner avatar  avatar JohnShukle avatar Juan Ignacio Fulponi avatar  avatar

geoarrow-r's Issues

as_geoarrow_array() method for sf objects

I suspected that as_geoarrow_array() would be able to work for anything that returns wk::is_handleable() == TRUE. as_geoarrow_array() fails on sf objects but succeed on s3 objects.

`st_collect()`, `st_as_sf()`, and default conversion from Arrow to R

Right now, geoarrow doesn't convert to sf by default and instead maintains a zero-copy shell around the ChunkedArray from whence it came. This is instantaneous and is kind of like ALTREP for geometry, since we can't do ALTREP on lists like Arrow does for character, integer, and factor. This is up to 10x faster and prevents a full copy of the geometry column. I also rather like that it maintains neutrality between terra, sf, vapour, wk, or others that may come along in the future...who are we to guess where the user wants to put the geometry column next? The destination could be Arrow itself (e.g., via group_by() %>% write_dataset()), or the column could get dropped, filtered, or rearranged before calling an sf method.

However, 99% of the time a user just wants an sf object. After #20 we can use sf::st_as_sf() on an arrow_dplyr_query to collect() it into an sf object, and @boshek suggested st_collect(), which is a way better name and is more explicit than a st_as_sf(). There's also st_geometry(), st_crs(), st_bbox(), and st_as_crs() methods for the geoarrow_vctr column; however, we still get an awkward error if we collect() and then try to convert to sf:

vctr <- geoarrow::geoarrow(wk::wkt("POINT (0 1)", crs = "OGC:CRS84"))
df <- data.frame(geometry = vctr)
sf::st_as_sf(df)
#> Error in st_sf(x, ..., agr = agr, sf_column_name = sf_column_name): no simple features geometry column present

That might be solvable in sf, although I'd like to give the current implementation a chance to get tested to collect feedback on whether this is or is not a problem for anybody before committing to the current zero-copy-shell-by-default.

geoarrow still drops CRS

Hello, I see that #16 is marked as closed. However I still encounter this issue on a newly installed version of the package on Windows (R 4.2.2).

library(geodata)
# Fetch data and transform to sf
countries <- world(resolution=2, level=0, path = ".") 
countries <- st_as_sf(countries)

# Write and read the output
write_geoparquet(countries, "countries.parquet")
countries_reload <- read_geoparquet("countries.parquet")

# There is no crs in the reloaded file
st_crs(countries_reload)
# Coordinate Reference System: NA

# There was one in the initial object
st_crs(countries)
# Coordinate Reference System:
#   User input: GEOGCRS["unknown",
#     DATUM["World Geodetic System 1984",
#         ELLIPSOID["WGS 84",6378137,298.257223563,
#             LENGTHUNIT["metre",1]],
#         ID["EPSG",6326]],
#     PRIMEM["Greenwich",0,
#         ANGLEUNIT["degree",0.0174532925199433],
#         ID["EPSG",8901]],
#     CS[ellipsoidal,2],
#         AXIS["longitude",east,
#             ORDER[1],
#             ANGLEUNIT["degree",0.0174532925199433,
#                 ID["EPSG",9122]]],
#         AXIS["latitude",north,
#             ORDER[2],
#             ANGLEUNIT["degree",0.0174532925199433,
#                 ID["EPSG",9122]]]] 
#   wkt:
# GEOGCRS["unknown",
#     DATUM["World Geodetic System 1984",
#         ELLIPSOID["WGS 84",6378137,298.257223563,
#             LENGTHUNIT["metre",1]],
#         ID["EPSG",6326]],
#     PRIMEM["Greenwich",0,
#         ANGLEUNIT["degree",0.0174532925199433],
#         ID["EPSG",8901]],
#     CS[ellipsoidal,2],
#         AXIS["longitude",east,
#             ORDER[1],
#             ANGLEUNIT["degree",0.0174532925199433,
#                 ID["EPSG",9122]]],
#         AXIS["latitude",north,
#             ORDER[2],
#             ANGLEUNIT["degree",0.0174532925199433,
#                 ID["EPSG",9122]]]] 

WKB with non-2D dimensions doesn't follow ISO encoding

I noticed that WKB geometry with xyz, xym or xyzm coordinate dimension use the 30th and 31th most-significant bits of the int32 flag at offset 1 of the WKB instead of the 2D_code+1000, 2D_code+2000, 2D_code+3000 used by ISO WKB.

Release geoarrow 0.1.0

(this is still a few months off, but is a hook to keep track of/discuss progress related to the initial release)

First release:

Prepare for release:

  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • rhub::check(platform = 'ubuntu-rchk')
  • rhub::check_with_sanitizers()
  • Review pkgdown reference index for, e.g., missing topics
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • usethis::use_news_md()
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

point-default.parquet is not readable with pyarrow / arrow C++

>>> import pyarrow.parquet as pq
>>> pq.read_table('inst/example_parquet/point-default.parquet')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/even/arrow/cpp/build/myvenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1996, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/home/even/arrow/cpp/build/myvenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1831, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 323, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2311, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected all lists to be of size=2 but index 3 had size=0

On a OGR Parquet driver I'm developing, I can also reproduce the same issue with NULL Point. It seems that the Arrow C++ library doesn't correctly handle writing (or reading ? I'm not sure which side is broken) a NULL entry for a FixedSizeList in the Parquet format (this works correctly for Feather). The workaround I found is to write a POINT EMPTY instead of a NULL entry.

Support RecordBatch with geoarrow

Using some arrow-rs, geoarrow-rust, and extendr magic, I am able to return a RecordBatch with a geoarrow array in it to R as a nanoarrow_array_stream, however, using geoarrow-r I've not been able to get this as a geoarrow array. I can use as.data.frame() to get it into a data.frame but without any nice geometry column

library(httr2)
devtools::load_all()
#> ℹ Loading serdesri
furl <- "https://services.arcgis.com/P3ePLMYs2RVChkJx/ArcGIS/rest/services/USA_Counties_Generalized_Boundaries/FeatureServer/0"
url <- paste0(furl, "/query?where=1=1&outFields=*&f=json&resultRecordCount=100")
req <- httr2::request(url)
resp <- httr2::req_perform(req)
json <- httr2::resp_body_string(resp)

# parse body as RecordBatch
res <- parse_esri_json_raw_geoarrow(resp$body, 2)
res
#> <nanoarrow_array_stream struct<OBJECTID: int64, NAME: string, STATE_NAME: string, STATE_FIPS: string, FIPS: string, SQMI: double, POPULATION: int32, POP_SQMI: double, STATE_ABBR: string, COUNTY_FIPS: string, Shape__Area: double, Shape__Length: double, geometry: geoarrow.polygon{list<rings: list<vertices: fixed_size_list(2)<xy: double>>>}>>
#>  $ get_schema:function ()  
#>  $ get_next  :function (schema = x$get_schema(), validate = TRUE)  
#>  $ release   :function ()  

x <- as.data.frame(res)
#> Warning in warn_unregistered_extension_type(x): geometry: Converting unknown
#> extension geoarrow.polygon{list<rings: list<vertices: fixed_size_list(2)<xy:
#> double>>>} as storage type
#> Warning in warn_unregistered_extension_type(storage): geometry: Converting
#> unknown extension geoarrow.polygon{list<rings: list<vertices:
#> fixed_size_list(2)<xy: double>>>} as storage type
head(x$geometry)
#> <list_of<list_of<list_of<double>>>[6]>
#> [[1]]
#> <list_of<list_of<double>>[1]>
#> [[1]]
#> <list_of<double>[39]>
#> ... truncated it for everyone's sake

`geoarrow()` drops CRS sometimes?

# convert geometry to geoarrow encoding
  geom <- as_geoarrow(
    table_sorted$geometry,
    schema_override = geoarrow_schema_wkb()
  )
  # TODO: this shouldn't drop CRS but it does
  geom <- geoarrow(geom)

Error when loading the example - missing Z values

Hi there,

When i try and load a geoparquet (including the example) i get the following error. I think this is to do with the Z dimension as when i tried it with a geoparquet with a z dimension it loaded fine.

nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf")) write_geoparquet(nc, "nc.parquet") read_geoparquet_sf("nc.parquet")

error
Error in geoarrow_schema_wkb(name = schema$name, format = schema$format, : startsWith(format, "w:") || isTRUE(format %in% c("z", "Z")) is not TRUE

geoarrow schema not interpretable by `arrow::as_schema()`

The schema created by infer_geoarrow_schema() cannot be parsed by arrow::as_schema(). I also am having problems with parsing the schema using Rust FFI bindings. I wonder if these could be related.

x <- sf::st_read(system.file("shape/nc.shp", package = "sf")) |> 
  sf::st_geometry() |> 
  geoarrow::as_geoarrow_array()
#> Reading layer `nc' from data source 
#>   `/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/sf/shape/nc.shp' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27

geoarrow::infer_geoarrow_schema(x) |> 
  arrow::as_schema()
#> Error: Invalid: Cannot import schema: ArrowSchema describes non-struct type geoarrow.multipolygon <CRS: {
#>   "$schema": "https://pro...

Created on 2024-01-28 with reprex v2.0.2

Handling of geoparquet when not loading `geoarrow`

First of all, thanks for this awesome work. It's been great to see the progress on all this :-)

In the example on the readme, you load a .parquet file that contains a geometry example. Since there is not a separate naming format/convention (e.g. .geo.parquet or .geoparquet), I might not know that there is a geometry in there, so I just load arrow and open the dataset as normal. Looking at the geometry column would be confusing to me. This behavior differs whether I have the geoarrow package loaded or not.

library(tidyverse)
library(arrow)

open_dataset("~/Desktop/nc.parquet") |>
  head(n = 1) |>
  pull(geometry, as_vector = TRUE)
#> <arrow_binary[1]>
#> [1] 01, 06, 00, 00, 00, 01, 00, 00, 00, 01, 03, 00, 00, 00, 01, 00, 00, 00, 1b, 00, 00, 00, 00, 00, 00, a0, 41, 5e, 54, c0, 00, 00, ...

library(geoarrow)
open_dataset("~/Desktop/nc.parquet") |>
  head(n = 1) |>
  pull(geometry, as_vector = TRUE)
#> <geoarrow_wkb[1]>
#> [1] MULTIPOLYGON (((-81.47276 36.23436, -81.54084 36.27251, -81.56198 36.27359, -81.63306 36.34069, -81.74107 36.39178, -81.69828 36.47178...

This issue might should be in the R arrow package, but I'm wondering if arrow should detect when there is a geometry column present and adjust behavior (the metadata is in there, so this information is known). For example, when calling collect(), should there be a warning that a geometry column is being collected and that geoarrow::st_collect() might be the better option (as in #21)? Or a warning when opening a geoparquet without geoarrow loaded?

library(tidyverse)
library(arrow)

nc = open_dataset("~/Desktop/nc.parquet") 
# We know there is a geometry from the metadata
nc$metadata[[1]]
#> [1] "{\"version\":\"0.3.0\",\"primary_column\":\"geometry\",\"columns\":{\"geometry\":{\"encoding\":\"WKB\",\"crs\":\"GEOGCS[\\\"NAD27\\\",DATUM[\\\"North_American_Datum_1927\\\",SPHEROID[\\\"Clarke 1866\\\",6378206.4,294.978698213898]],PRIMEM[\\\"Greenwich\\\",0],UNIT[\\\"degree\\\",0.0174532925199433,AUTHORITY[\\\"EPSG\\\",\\\"9122\\\"]],AXIS[\\\"Latitude\\\",NORTH],AXIS[\\\"Longitude\\\",EAST],AUTHORITY[\\\"EPSG\\\",\\\"4267\\\"]]\",\"bbox\":[-84.3239,33.882,-75.457,36.5896],\"geometry_type\":\"MultiPolygon\"}}}"

Handle multiple dimensions among features/respect strict = TRUE

A few options:

  • Error (what happens now)
  • Fill extra dimensions with NaN if strict is TRUE and there are extra dimensions
  • Drop dimensions if strict is TRUE and the dimension isn't supposed to be there

Perhaps all of those (make a user opt-in to extra dimensions filled with NaN)? Either way, strict = TRUE is might not be respected or might give a different error because the schemas aren't compatible (clearly this isn't tested).

Notebook Viewer in RStudio errors when viewing a geoarrow_vctr

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(arrow, warn.conflicts = FALSE)
library(geoarrow)

bucket <- s3_bucket("voltrondata-public-datasets")
ds <- open_dataset(bucket$path("phl-parking"))
ds %>% 
  head() %>% 
  collect()
#> # A tibble: 6 × 13
#>   anon_ticket_number issue_datetime      state anon_plate_id division location  
#>                <int> <dttm>              <chr>         <int>    <int> <chr>     
#> 1              39985 2011-12-31 21:17:00 PA          1606959       NA 832 N 40T…
#> 2              41812 2011-12-31 21:54:00 PA           503820       NA 7200 N 19…
#> 3              41814 2011-12-31 21:45:00 PA          1102245       NA 7900 PROV…
#> 4              46288 2011-12-31 20:09:00 NJ           427139       NA 450 N 6TH…
#> 5              46289 2011-12-31 20:10:00 NJ           308463       NA 448 N 6TH…
#> 6              46290 2011-12-31 20:12:00 PA          1585402       NA 446 N 6TH…
#> # … with 7 more variables: violation_desc <chr>, fine <dbl>,
#> #   issuing_agency <chr>, gps <lgl>, zip_code <int>, geometry <grrw_pnt>,
#> #   year <int>

(Except in in RStudio Notebook I get:

Error in wk_handle.geoarrow_vctr(handleable, wkt_format_handler(precision = precision,  : 
  `` is an external pointer to NULL

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.