Giter Site home page Giter Site logo

geocompr / geocompr.github.io Goto Github PK

View Code? Open in Web Editor NEW
9.0 7.0 9.0 108.33 MB

Place to host presentations, articles and other types of web content related to geocomputation with R, Python, etc.

Home Page: https://geocompr.github.io/

HTML 71.56% CSS 3.14% JavaScript 21.88% TeX 0.87% R 0.01% Less 1.26% SCSS 1.28%
r spatial geocomputation python geopython rspatial

geocompr.github.io's Introduction

R build status

geocompr.github.io's People

Contributors

andrewmaclachlan avatar dependabot[bot] avatar dpprdan avatar jannes-m avatar makosak avatar nowosad avatar robinlovelace avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

geocompr.github.io's Issues

New domain

geocomp.xyz

Questions and comments:

  1. What should be our main GitHub base (geocompr organization? geocompx? new geocompxyz organization?
  2. What should be on our main webpage? Are we still keeping the blogdown page (https://geocompr.github.io/) after updating it a bit?
  3. Will we be using subdomains (e.g., r.geocomp.xyz) or subdirectories (e.g., geocomp.xyz/r)?
  4. We need to create redirection from geocompr.robinlovelace.net/ to the new webpage (this also applies to other versions of the book).
  5. SWAG?

discussion about buffer vs st_is_within_distance

Not really an issue just a start of discussion about buffer versus sf::st_is_within_distance.

Context : reproducing an analysis done in PostGIS with R & sf

PostGIS in Action by Leo Hsu and Regina Obe came out with its third edition this year. My SQL and PostGIS were rusty so it was a good time to read it again and try to reproduce some of it in R. When I read a technical book, I like to try the exercises in another language I am more familiar with. It takes more time, but I feel like this is a more active way of learning that fits me better.

The Hello world of this book provides you with restaurants and roads in the US and asks "to find the number of fast-food restaurants within one mile of a highway". Restaurants will be represented as points and highways as linestrings. This can be applied to a lot of other topics (pollution sites and rivers for example).

You can download the data here. The first chapter ("What is a spatial database") is also available for free on the editor's website.

In this post, we will compare two ways of doing this task in R: using buffers and using sf::st_is_within_distance().

Loading library and data

You have two options to load the data. The first option is to use PostgreSQL/PostGIS to do it and put it in a database (DB) and then access it with the help of R's packages DBI and RporstgreSQL. The second option is simply to use R and sf.

We will start with the second, because it doesn't involve using SQL but we will provide you a quick way of using the first one with R afterward if you are curious.

Loading data with R and sf

library(sf)
library(dplyr)

# Loading restaurants,  =======================================================
# CSV doesn't have header
restaurants.dat <- read.csv("data/restaurants.csv",
                        header = FALSE,
                        col.names = c("id",
                                      "lat",
                                      "lon")) 

# classic lat/long column to sf see https://geocompr.robinlovelace.net/spatial-class.html                        
restaurants.shp <- sf::st_as_sf(restaurants.dat,
                                coords = c("lon", "lat"), 
                                crs = 4326)

# Here we used the same CRS recommended in PostGIS in action 
# EPSG:2163 is also a projected CRS which is important to calculate distance and do buffer
# see : https://geocompr.robinlovelace.net/reproj-geo-data.html#which-crs-to-use 
restaurants2163.shp <- sf::st_transform(restaurants.shp, 2163)

# Loading highways ============================================================

roads.shp <- sf::st_read("data/roadtrl020.shp",
                            crs = 4269)

# same projection as our restaurants
roads2163.shp <- sf::st_transform(roads.shp, 2163)

# Creating franchises table ===================================================
# this is not really needed for our example but PostGIS uses it so why not add it!
# this also shows one good DB practice: Normalisation  

franchises.dat <- data.frame(
    id = c("BKG",  "CJR",  "HDE", "INO", "JIB", "KFC", "MCD", "PZH", "TCB", "WDY"),
    franchise = c("Burger King", "Carl''s Jr", 
                  "Hardee", "In-N-Out",
                  "Jack in the Box",  "Kentucky Fried Chicken",
                  "McDonald", "Pizza Hut",
                  "Taco Bell", "Wendys"))

Now we have the 3 tables. You can explore them a bit. restaurants2163.shp has 50 002 features (points) and roads2163.shp has 47 014 features with 10 fields. We have roads but we want highways. In SQL it was done while populating the table with the "WHERE" clause:

INSERT INTO ch01.highways (gid, feature, name, state, geom)
SELECT gid, feature, name, state, ST_Transform(geom, 2163)
FROM ch01.highways_staging
WHERE feature LIKE 'Principal Highway%';

roads2163.shp has a field called feature where the type of road is stored (to mess with your GIS brain, where feature indicates an observation!). This is a bit tricky because you have Principal Highway but also Principal Highway Business Route and a bunch of other options. In PostgreSQL, % is a wildcard used with the LIKE operator to represent 0, 1 or multiple characters. In R we use grepl() with ^Principal Highway to create a vector with all the values needed and then we could use it with the %in% operator to return only the roads that match the pattern. In the end we get 16433 highways.

highway_type <- unique(roads2163.shp$FEATURE)

principal_highway <- highway_type[grepl(pattern = "^Principal Highway", 
                                        highway_type)]

highways2163.shp <- roads2163.shp[roads2163.shp$FEATURE %in% principal_highway,] 

If we compare in terms of lines of code, doing it with R is shorter. This is thanks to sf and also because with PostgreSQL/PostGIS we have to define tables and relations between them (ie we have a strict data model). Both options have their pros & cons.

Loading data from PostGIS

If you followed the book's instructions and loaded data in a spatial database you can also access it with R and sf. Feel free to skip this part if you prefer to read about buffer versus sf::st_is_within_distance().

I usually create a file called code.R where I store the DB connection information (user, password, etc) and I add this file in .gitignore. Then you can create a small function that calls it and creates a connection to the DB. Here is an example (it can be improved!):

connect_to_db  <- function() {
# 1 Load library and code =====================================================
source("code.R", local = TRUE) # local = TRUE load in function envt

library(RPostgreSQL) 

# 2 establish connection  =====================================================

drv <- dbDriver("PostgreSQL")

# db, adress, usr, pwd are all defined in code.R 
con <- dbConnect(drv, 
                 dbname = db,
                 host = adress , 
                 # the default port for postgres I usually change it when the DB is not on localhost
                 port = 5432, 
                 user = usr, 
                 password = pwd) 

# The connection needs to be in globalenv to be used 
# If you have a better way of doing it I will take it! 
assign("con", con, envir = .GlobalEnv)

# 3 Print useful stuff  =====================================================

# list tables in the DB and check connection 
print(DBI::dbListTables(con))

# a reminder 
print("Reminder: use DBI::dbDisconnect(con)!")
}

After that, you can use your function to create a connection to the DB and query the data using sf.

connect_to_db()
# sf is awesome! This also illustrates how easily R interacts with other languages/systems. 
restaurants2163.shp <- sf::st_read(con, query = "select * from ch01.restaurants;")
highways2163.shp  <- sf::st_read(con, query = "select * from ch01.highways;")
franchises.dat <- DBI::dbGetQuery(con, statement = "select * from ch01.lu_franchises;")

# Be friendly with your sys admin and disconnect!
dbDisconnect(con)

Finding the number of fast-food restaurants within one mile of a highway!

The PostGIS way :

Let's first see the PostGIS way of doing it:

SELECT f.franchise
 , COUNT(DISTINCT r.id) As total -- <1>
FROM ch01.restaurants As r 
  INNER JOIN ch01.lu_franchises As f ON r.franchise = f.id
    INNER JOIN ch01.highways As h 
      ON ST_DWithin(r.geom, h.geom, 1609) -- <2>
GROUP BY f.franchise
ORDER BY total DESC;

The real "magic" is in <2> with the INNER JOIN. The ST_DWithin() function will return TRUE or FALSE if the geometry is within the distance. Yes, 1609 m is one mile for those like me who live in the metric system utopia. You have a nice trick on <1> with a DISTINCT inside a COUNT() to remove duplicate id's because one restaurant can be counted twice (or more) if it meets the condition on more than one highway. The rest is just either classic SQL organizing data or some aliasing needed in PostgreSQL.

This is the result:

franchise total
McDonald 5343
Burger King 3049
Pizza Hut 2920
Wendys 2446
Taco Bell 2428
Kentucky Fried chicken 2371
Hardee 1077
Jack in the Box 509
Carl's Jr 224
In-N-Out 44

Thanks to \timing we know that this was done in 539,606 ms on my three year old laptop with 8GB of RAM. This is really quick but keep in mind most of the work, like adding a spatial index, was done before.

Doing it with R

At first we tried to do it with the sf function st_is_within_distance()

dist_to_check <- 1609

# I have run this code but you shouldn't unless you have plenty of time!
# timer <- Sys.time()
# the difficult part is the use of lengthS() <- see the S
# see: https://geocompr.robinlovelace.net/spatial-operations.html#topological-relations 
# near_or_not <- lengths(sf::st_is_within_distance(restaurants2163.shp,
#                                                              highways2163.shp,
#                                                              dist_to_check)) > 0
# 
# resto_near_roadv2 <- restaurants2163.shp[near_or_not,]
# Timing_result <- difftime(Sys.time(),  timer)

It take us a bit less than 13 mins. A lot of time and it also crashed Rstudio, probably because I didn't have enough memory.

When you just care about how many points are within a specific distance you can also use some buffer: we can buffer the points and see which ones intersect our lines. We could also have buffered the road and counted points in it.

timer <- Sys.time()
# step 1 buffer
buffered_restaurants <- sf::st_buffer(restaurants2163.shp, dist = dist_to_check) 
# step 2 buffered restaurant intersecting a highway
resto_near_road <- buffered_restaurants[highways2163.shp,]
Timing_result <- difftime(Sys.time(),  timer)

This was just a bit more than 6 seconds. Yes, what a difference!

We then just need to use some dplyr and we are done:

# this is just tidyverse go brrr!
resto_near_road %>% 
    sf::st_drop_geometry() %>% 
    dplyr::left_join(franchises.dat, 
                     by = c("franchise" = "id")) %>% 
    dplyr::group_by(franchise) %>% 
    dplyr::summarise(total = dplyr::n()) %>% 
    dplyr::arrange(dplyr::desc(total))

If we compare that result with the one we get with Postgis and the one we get from st_is_within_distance().We have one more Wendy's restaurants in st_is_within_distance() and postgis than with the buffer's way.

A more accurate benchmark?

To produce a bit more accurate benchmark we will need to take a sample of this data set. Prototyping our scripts on a smaller subset is also a good practice. For that we will use the library spData to get the US States and just look at the fast-foods restaurants and highways in Utah.

library(spData)

Utah <- us_states[us_states$NAME == "Utah",]
Utah <- sf::st_transform(Utah, 2163)

resto_utah.shp <- restaurants2163.shp[Utah,]
road_utah.shp <- highways2163.shp[Utah,]

Now we can use the microbenchmark package to get better indicators than our sys.time() tricks.

library(microbenchmark)

microbenchmark(
    buffer = {
        buffered_restaurants <- sf::st_buffer(resto_utah.shp, dist = dist_to_check) 
        resto_near_road <- buffered_restaurants[road_utah.shp,]
    },
    within_distance ={
        near_or_not <- lengths(sf::st_is_within_distance(resto_utah.shp,
                                                             road_utah.shp,
                                                             dist_to_check)) > 0
        resto_near_roadv2 <- restaurants2163.shp[near_or_not,]
    },
    times = 100
 )
expr min lq mean median uq max neval
buffer 118.7830 123.9558 132.3502 125.8452 132.0432 235.9454 100
with_distance 618.5249 633.8456 672.2927 640.7561 668.1274 1149.8452 100

The buffer version is 5 times quicker! So if you just want to know if some points are close enough to other objects, it is better to do it with a buffer. If you need to have a list containing each index of objects matching the predicate for every point (way more information), you should use the st_is_within_distance() functions.

covid - review

@Robinlovelace some of my comments after looking and the blog post:

  • It would be probably good not to download world data directly in the blog post. Maybe it is better to read it from file (or similar) with some hidden code?
  • There are many code chunks (and their outputs) at the beginning of the blog post. Maybe some are not necessary?
  • Not all animations are visible in the blog post
  • I think the blog post needs some more text explaining the ideas and the code

Feel free to change whatever you think is necessary and publish the blog post

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.