Giter Site home page Giter Site logo

geodist's Introduction

hypertidy

[ ... ]

The hypertidy work plan

Build out a core set of packages for developing projects.

core components

  • point-in-polygon (see pfft, spacebucket, polyclip)
  • path-reconstruction-from-edges (dodgr magic)
  • distance-to and distance-from (best currently is spatstat and sf)
  • discretization and other forms of coordinate transformation
  • core input interfaces that are lazy, see lazyraster/vapour and tidync as workings toward a common framework

vapour

  • check vapour on MacOS
  • check license / copyright issues for vapour
  • build sfcore
  • put core raster logic in discrete

NetCDF

  • fix group_by for tidync
  • find a way to cut dimensions (dplyr::tile?)

anglr

  • integrate TR's quadmesh helpers

silicate

  • dynamic edges, record original start/end edge with throw-away Steiner points?

  • inner cascade and semi cascade, names for the concept, activate to start in the middle, formal join-ramp

  • finish edge and arc-node workers

  • restructure sc around the gibble concept

  • consolidate silicate (remove sc/scsf)

  • release unjoin and gibble for use by the sc family

overall approach

The hypertidy approach to complex data structures aims to bring the goals of the tidyverse to spatial data with this single point of perspective: the only thing that makes geo-spatial data special is the system of coordinate transformations that provide a family of compromises for generating and working with a particular set of space-properties. All the other aspects that get special attention are completely shared by other domains such as graphics, model structures, grid domains and aspects of user-interactivity and ease of use. Further, the tidy principles dictate that the majority of data manipulation and analysis is best handled using database principles and technologies, and "geo-spatial" is no different. No special handling is required, and we believe strongly, current idioms and established practices involving special handling hinder innovation, education and understanding generally. Other fields have essentially solved all of the main problems in data analysis, handling and user-interaction but traditions in other fields are preventing the use of these solutions in optimal ways.

This applies to drawings, GIS vector points, lines, areas, simple features, segment-based linear paths, triangulations and other forms of meshes, and we consider them all to be one of a piecewise linear complex or a simplicial complex, with (this bit is crucial) further levels of organization within and between primitive components.

Recent legacy optimizations made in geo-spatial fields have seen a strong focus on the path, which is an ordered sequence of coordinates, and the path is implicit, defined by "joining the dots" between each coordinate. There is a dual to the path, which is an unordered set of edges (a.k.a. line segments) where each pair of coordinates traversed by an edge is reference implicitly by name.

This provides a set of forms for complex structures that collectively allows generic transformation workflows.

    1. Bespoke formats. This is what we have, there are many.
    1. Structural vertex-instance set and path-geometry map.
    1. Normal form path topology.
    1. simplicial complex forms

The last two here include vertex-topology, in that each vertex is a unique coordinate and may be referenced multiple times. 1 is a special case for transition between path based forms

There are several required forms. It's not clear to me that the below is a sequence, for example arc-node is not necessarily a good pathway from 3 to 5, since 4 is an specialization of planar linear forms for polygons or networks, not an required intermediate.

1 and 2 suffer from requiring an implicit or structural order for the sequence of coordinates within a path.

  1. Paths in multiple tables with a form of structure index, a run-length map. This is what sc_coord, sc_path, and sc_object provide.
  2. Relational paths, no structural index and no de-duplication, all that is needed is vertex, path, object.
  3. Relational paths normalized by vertex, requires a path_link_vertex table to link the path and de-duplicated vertices.
  4. Relational paths normalized by vertex and by path. It's probably never worth normalizing a polgonal path, but it is worth it for arc-node models, such as TopoJSON, OSM ways, and the data at the core of the maps package.
  5. Relational directed linear segments, first form treats segments in the way that 2 treat coordinates - no de-duplication, and so the path is record with the segment ID.
  6. Relational undirected linear segments, this second form de-duplicates segments, ignoring their orientation, and requires a new link table between segment and path (that is where the orientatoin could be stored, if it's needed). (This is what TopoJSON does, storing a 1 or -1 for orientation.
  7. Relational triangles, composed of segments.
  8. Relational triangles, forget the segments.
  9. Higher forms?

path

  • same for polygons and lines and points
  • involves normalization of vertices, but maybe it should not
  • object, path, coordinate
  • combining paths is trivial, possibly is the lowest common denom for merging

path_topological

  • could involve vertex normalization
  • arc normalization?
  • how to record/infer closed paths, probably is explicit?
  • object, path, coordinate, vertex
  • combining these is tough, need first to -- expand all coords? -- normalize vertices of separete inputs, then merge?

: arc normalization is problematic I think, does it imply segment normalizatoin first?

: is part of the key here to keep links to the inputs as they were?

segment / 1D primitive

  • true edge graph
  • definitely requires node inference and inclusion

triangles / 2D primitive

  • segment is truly a prerequisite
  • CGAL seems prone to duplicates and cross over segs

Inputs?

  • lines and lines, needs noding result is

Lessons from silicore

https://github.com/hypertidy/silicore#the-longer-silicore-story

Lessons from the space bucket

Point in polygon is core.

Determining if a point falls inside a ring is classifying that point with that path. When paths can be nested there needs to be logic for holes, and for multiple classifications - however it's achieved. Obviously the search space can be optimized for multi-points, multi-paths.

In hypertidy/pfft we isolate the conversion of an edge-form to a triangulation and its complementary point-in-path so we can filter out holes and classify triangle instances.

Paths or segments?

We absolutely need both, for intersections we need the triangle filtering/classification. The gibble is a run-length-encoding into the in-order set of coordinate instances, and this is straightforward from native-sf, and also straightforward from the dense vertex set as long as the order can be maintained - otherwise via path-composition from arbitrary edges. I feel that the gibble is invalidated by vertex densification, and probably we don't care if path-composition is trivial, and holes are inherent in triangulations anyway. At any rate, if we use a triangulation for intersection, then it is the dead-end, because it provides information about the SF inputs. Otherwise we leave the inputs behind and use only primitives, then restore as needed.

Extents is core

We need entities that act like a set of bbox/extents - storing only four numbers for each. An SF-form from these is purely on-demand. A raster is the densest, a rectlinear grid is next, and corner-based mesh is next, then the set of quads as a special case of the more general "set of extents".

We use an extents entity to quad-tree optimize things like point-in-polygon classification.

geodist's People

Contributors

daniellemccool avatar mpadge avatar olivroy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

geodist's Issues

Cheap is being expensive

Hi,
Using the default method "cheap" when either side of the -180/180 line results in it calculating the long way around. Talking to @mdsumner I got around that by using the method "geodesic" but I thought I would highlight it.

lower<-c( 176.6008, -71.094)#NZ_1265_45_1 
upper<-c( -176.8917, -66.77333)#NZ_1262_03_3 
geodist::geodist(lower, upper)/1000
#>          [,1]
#> [1,] 14128.98

geodist::geodist(lower, upper, measure = "geodesic")/1000
#>          [,1]
#> [1,] 547.4822

Created on 2020-07-20 by the reprex package (v0.3.0)

which metrics?

here you go @mdsumner - this should be very quick to assemble. We can just start with vectorized C++ implementations of:

  1. Haversine
  2. Vincenty
  3. Mapbox cheap, as explained in this blog post.

This will all be able to be done quickly with only Rcpp dep, but we could even dump that and then be essentially dependency free?

Fix cheapdist

@mdsumner This one's for you, also to help you get your teeth into the code. I've implemented Haversine, Vincenty, and the mapbox cheap ruler, but the latter does not give consistent results:

n <- 50
dx <- dy <- 0.01
x <- cbind (-100 + dx * runif (n), 20 + dy * runif (n))
y <- cbind (-100 + dx * runif (2 * n), 20 + dy * runif (2 * n))
colnames (x) <- colnames (y) <- c ("x", "y")
d1 <- geodist (x, measure = "haversine")
d2 <- geodist (x, measure = "cheap")
plot (d1, d2, pch = 1)

junk

I think part of the problem is that the notes in their blog description purport to calculate cos((lat1 + lat2) / 2 as if these are in degrees rather than radians. I've converted them to radians, but have not converted other values. These kind of inconsistencies are the likely cause, and likely all that need be done is to look inside the cheap rule source code and fix my coded interpretation of the blog accordingly.

geodist_benchmark function

geodist_benchmark (lon, lat, d)

returns a list of errors, both in metres and relative, for the available measures quantified against geodesic code.

units

extending from #1, implement a units parameter to set user-determined units. This is done very neatly in cheap ruler, conveniently giving us our desired list of units to include.

Updated version of package fails when y coordinate is called ymax

Discovered in the slopes package and documented here: ropensci/slopes#25

After some investigation it seems the names of the coordinates are the cause:

geodist::geodist(c(xmin = -9.2, ymin = 39), c(xmax = -9.1, ymin = 38))
#> Maximum distance is > 100km. The 'cheap' measure is inaccurate over such
#> large distances, you'd likely be better using a different 'measure'.
#>          [,1]
#> [1,] 111472.8
geodist::geodist(c(xmin = -9.2, ymin = 39), c(xmax = -9.1, ymax = 38))
#> Error in find_xy_cols(obj): Unable to determine longitude and latitude columns; perhaps try re-naming columns.

Created on 2021-02-02 by the reprex package (v1.0.0)

Fix package man entry

From Kurt Hornik, 19/08/2023:

Dear maintainer,

You have file 'geodist/man/geodist.Rd' with \docType{package}, likely
intended as a package overview help file, but without the appropriate
PKGNAME-package \alias as per "Documenting packages" in R-exts.

This seems to be the consequence of the breaking change

Using @doctype package no longer automatically adds a -package alias.
Instead document _PACKAGE to get all the defaults for package
documentation.

in roxygen2 7.0.0 (2019-11-12) having gone unnoticed, see
r-lib/roxygen2#1491.

As explained in the issue, to get the desired PKGNAME-package \alias
back, you should either change to the new approach and document the new
special sentinel

"_PACKAGE"

or manually add

@Aliases geodist-package

if remaining with the old approach.

add sequential param?

sequential = TRUE (non-default) would calcualte sequential distances along the rows of a single object, precisely as geosphere does it, so would return n-1 distances rather than the entire n-by-n matrix. @mdsumner whaddayareckon? Methinks likely quite useful.

Request option not to warn when using cheap (or even test distances)

I use geodist to test whether or not two points are within a very small range of each other (80 - 200 meters) while identifying stops.

For these purposes, cheap is just fine, so I'd prefer to use it. Unfortunately, not all the points that are tested are close together. What this usually means is despite the fact that any of the measures that ARE above 100km are uninteresting to me anyway, I have to deal with a wall of warnings and, presumably, some efficiency loss because it's being tested every time.

It's a minor annoyance, but maybe there are others that feel this way?

benchmark not meaningful; sf doesn't use Vincenty

In the benchmark on your README, you compare runtime with sf but not the outcomes, which are entirely different: as crs hasn't been set in the sf objects, it computes Euclidian distances. Setting it to 4326 it will compute ellipsoidal distances in m (use %>% st_sfc(crs = 4326)). This will also increase your speed gain to a factor 40.

The reason for the slowness of the library used by sf (PROJ) is that it computes ellipsoidal distances through the geographiclib routines by Karney (https://link.springer.com/content/pdf/10.1007/s00190-012-0578-z.pdf), which approximate numerically up to 8 byte double precision. The paper points out the differences of this method, and Vincenty's - a recommended read.

Ditch Rcpp

Coz then we can make the all-important claim of being "Dependency free"!!! The calls are actually dead-easy, so it'll provide a good exercise for self and you @mdsumner in the ease of just using .Call().

geodist gives different results for single row matrix with cheap ruler

A one row matrix of from and to points gives a different distance as a many row matrix with the same coordinates when using the cheap ruler.

# example data
from <- matrix(rep(c(-0.193011, 52.15549),2), ncol = 2, byrow = TRUE)
to <- matrix(rep(c(-0.197722, 52.15395),2), ncol = 2, byrow = TRUE)
colnames(from) <- c("lon","lat")
colnames(to) <- c("lon","lat")

# Just first row of matrix
geodist::geodist(from[1, , drop = FALSE], to[1,, drop = FALSE], paired = TRUE, measure = "cheap")
# Both rows (gives different answer)
geodist::geodist(from, to, paired = TRUE, measure = "cheap")

# This doesn't happen with other measures
geodist::geodist(from[1, , drop = FALSE], to[1,, drop = FALSE], paired = TRUE, measure = "geodesic")
geodist::geodist(from, to, paired = TRUE, measure = "geodesic")

Min/Max Feature

Would it be a viable option there to search for min/max distances? You answered to a tweet of mine and showed geodist, the only problem with it was that it returns the full distance matrix.

If you have quite a large number of points this becomes rather soon a memory issue.
Would it be possible to only store a single number in the loop (e.g. check via ifelse if bigger/smaller each iteration)? One could set this possibly as argument in geodist().

Update CRAN version with geodist_paired_vec

I'm using geodist in a package that I might eventually want to put on CRAN, but I'd really benefit from the geodist_paired_vec method. I'm torn between adding it as a github dependency or keeping the current CRAN version.

If there were plans to update it soon, it'd save me the hard decision. :)

geodist_max/min functions

It would be useful to develop functions to return max or min distances, rather than all distances. Particularly useful in use cases with large inputs, say two relatively large 2-D matrices of N from and M to points where N-by-M would be too big to fit in memory, and you're only really interested in extracting minimal or maximal distances from each of the N points to whichever M point. The function would then return both max/min distance, and index into/ID of the M points.

Could then easily do a simple quadtree to cut the search space down based on coordinates differences only. This could simply be done in a fixed way, but would be more flexibly implemented as a "proper" quad-tree of successive cuts. Presuming direct ended up just selecting the closest 12.5% of points, that would require 2M subtractions, and M/8 distance calculations. An iterated quadtree would require:

iteration n subtractions n dists
1 2M -
2 M -
3 M / 2 -
4 M / 4 M / 8

thus adding only another 7M/4 subtractions, but far greater flexibility.

Issue warning when cheapdist becomes inaccurate

Following on from #25, the default measure = "cheapdist" ain't always appropriate, and it would probably be good to to a final check of max dist returned, and issue a warning to maybe use an alternative measure if this exceeds 100km.

Thanks

Hi guys, just wanted to say, huge thanks for this package: it's saved me a huge amount of programming and computer time! This is a godsend for work-in-progress updates to cyclestreets, an interface to routing service for cyclists by cyclists by @mvl22 & co:

  dist_list = lapply(coord_list, function(x) {
    geodist::geodist(
      data.frame(x = x[, 1], y = x[, 2]),
      sequential = TRUE
      )}
    )

Register C Callable

Inspired by recent discussions and progress in other libraries, and by data.table exposing their C functions, would you consider exposing the C functions from geodist so they can be called by other C/C++ code?

I've made a very quick prototype here, where all that was needed was to register a function as R_RegisterCCallable()

void R_init_geodist(DllInfo *dll)
{
    R_RegisterCCallable("geodist", "R_cheap", (DL_FUNC) &R_cheap);

    R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
    R_useDynamicSymbols(dll, FALSE);
}

Which then makes it availabel to other C/C++ code, for example

library(Rcpp)
cppFunction(
  code = '
    SEXP cheap( SEXP x ) {
      typedef SEXP R_CHEAP_FUN(SEXP); 
      R_CHEAP_FUN *R_cheap = (R_CHEAP_FUN *) R_GetCCallable("geodist","R_cheap");
      return R_cheap( x );
    }
  '
)

And a quick check, using the example from ?geodist

n <- 50
# Default "cheap" distance measure is only accurate for short distances:
x <- cbind (runif (n, -0.1, 0.1), runif (n, -0.1, 0.1))
y <- cbind (runif (2 * n, -0.1, 0.1), runif (2 * n, -0.1, 0.1))
colnames (x) <- colnames (y) <- c ("x", "y")
d0 <- geodist (x) # A 50-by-50 matrix

d1 <- cheap(x)

all.equal( as.numeric(d0), d1 )
# [1] TRUE

Use parallel?

Distance computations could easily be done in parallel, but ...

  1. This is an arguably somewhat heavy dependency, contra our aims, and
  2. It is likely that these kinds of calculations will be used by people who really care about efficiency, and who will accordingly likely embed them within their own parallel routines.

Thoughts @mdsumner?

Speed up R API by allowing direct submission of numeric vector objects

Current interface presumes data.frame or matrix objects at all times, but the benchmarks of @symbolixAU show how much can be gained if simple numeric vectors could be submitted -- roughly a 3-times speed gain -- nope, that was an error; actual gains are around 28%. Importantly, submitting numeric matrices is no faster than data.frame objects, and it is only direct submission of vectors that speeds things up so much. The Q is how such effective speed-ups might be enabled in terms of the R interface here?

Sequential geodist produces different results to paired

There seem to be different results produced when geodist is used in sequential mode or manually pairing the results

# Make example lon/lat points
mat <- matrix(c(-0.193011, 52.15549,
                -0.197722, 52.15395,
                -0.199949, 52.15527,
                -0.199533, 52.15762,
                -0.193205, 52.15757,
                -0.174015, 52.17529)
                , ncol = 2, byrow = TRUE)
colnames(mat) <- c("lon","lat")

# Use sequential mode
dist1 <- geodist::geodist(mat, sequential = TRUE)

# Manually make to and from objects
from <- mat[1:5,]
to <- mat[2:6,]
dist2 <- geodist::geodist(from, to, paired = TRUE)
identical(dist1, dist2) # FALSE

input formats

I'm currently imagining a generic

geodist(x, y)

where y is optional, and both are rectangular objects containing coordinates. If x only, then return a square matrix of distances; if y, then return a nrow(x)-by-nrow(y) matrix. Pretty much like geosphere functions.

@mdsumner Can you think of any other likely input formats? Have you ever wished geosphere allowed alternative input formats? (I surely haven't, and am always impressed by the input handling there.)

pkgcheck results - main

Checks for geodist (v0.0.7.022)

git hash: 96552497

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'CITATION' file.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 91.8%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE


1. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in C (89% in 14 files) and R (11% in 5 files)
  • 2 authors
  • 1 vignette
  • no internal data file
  • 1 imported package
  • 3 exported functions (median 27 lines of code)
  • 42 non-exported functions in R (median 12 lines of code)
  • 94 R functions (median 27 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 5 34.7
files_src 14 95.0
files_vignettes 1 68.4
files_tests 5 81.7
loc_R 319 34.2
loc_src 2692 80.5
loc_vignettes 182 46.0
loc_tests 416 71.3
num_vignettes 1 64.8
n_fns_r 45 53.4
n_fns_r_exported 3 12.9
n_fns_r_not_exported 42 62.9
n_fns_src 94 78.5
n_fns_per_file_r 6 73.0
n_fns_per_file_src 7 63.1
num_params_per_fn 4 54.6
loc_per_fn_r 13 39.7
loc_per_fn_r_exp 27 58.8
loc_per_fn_r_not_exp 12 39.1
loc_per_fn_src 27 80.7
rel_whitespace_R 30 51.8
rel_whitespace_src 16 78.3
rel_whitespace_vignettes 12 15.1
rel_whitespace_tests 14 59.8
doclines_per_fn_exp 38 47.0
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 201 88.7

1a. Network visualisation

An interactive visualisation of calls between objects in the package has been uploaded as a workflow artefact. To view it, click on results from the latest 'pkgcheck' action, scroll to the bottom, and click on the 'visual-network' artefact.


2. goodpractice and other checks

Details of goodpractice and other checks (click to open)

3a. Continuous Integration Badges

R-CMD-check
pkgcheck

GitHub Workflow Results

name conclusion sha date
pages build and deployment success 9a223d 2022-01-30
pkgcheck 965524 2022-01-30
pkgdown success 965524 2022-01-30
R-CMD-check success 965524 2022-01-30
test-coverage success 965524 2022-01-30

3b. goodpractice results

R CMD check with rcmdcheck

rcmdcheck found no errors, warnings, or notes

Test coverage with covr

Package coverage: 91.77

Cyclocomplexity with cyclocomp

No functions have cyclocomplexity >= 15

Static code analyses with lintr

lintr found no issues with this package!


Package Versions

package version
pkgstats 0.0.3.88
pkgcheck 0.0.2.227

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.