geocompr / geocompr.github.io Goto Github PK

View Code? Open in Web Editor NEW

9.0 7.0 9.0 108.33 MB

Place to host presentations, articles and other types of web content related to geocomputation with R, Python, etc.

Home Page: https://geocompr.github.io/

HTML 71.56% CSS 3.14% JavaScript 21.88% TeX 0.87% R 0.01% Less 1.26% SCSS 1.28%

r spatial geocomputation python geopython rspatial

geocompr.github.io's Introduction

geocompr.github.io's People

Contributors

Stargazers

Watchers

Forkers

dpprdan youngclick makosak joseenrique04 andrewmaclachlan defuneste laurielbaker

geocompr.github.io's Issues

New domain

geocomp.xyz

Questions and comments:

What should be our main GitHub base (geocompr organization? geocompx? new geocompxyz organization?
What should be on our main webpage? Are we still keeping the blogdown page (https://geocompr.github.io/) after updating it a bit?
Will we be using subdomains (e.g., r.geocomp.xyz) or subdirectories (e.g., geocomp.xyz/r)?
We need to create redirection from geocompr.robinlovelace.net/ to the new webpage (this also applies to other versions of the book).
SWAG?

Add blog post on installing R-spatial on Linux

Heads-up @Nowosad I'll likely request a review from you, by Friday is the aim.

discussion about buffer vs st_is_within_distance

Not really an issue just a start of discussion about buffer versus sf::st_is_within_distance.

Context : reproducing an analysis done in PostGIS with R & sf

PostGIS in Action by Leo Hsu and Regina Obe came out with its third edition this year. My SQL and PostGIS were rusty so it was a good time to read it again and try to reproduce some of it in R. When I read a technical book, I like to try the exercises in another language I am more familiar with. It takes more time, but I feel like this is a more active way of learning that fits me better.

The Hello world of this book provides you with restaurants and roads in the US and asks "to find the number of fast-food restaurants within one mile of a highway". Restaurants will be represented as points and highways as linestrings. This can be applied to a lot of other topics (pollution sites and rivers for example).

You can download the data here. The first chapter ("What is a spatial database") is also available for free on the editor's website.

In this post, we will compare two ways of doing this task in R: using buffers and using sf::st_is_within_distance().

Loading library and data

You have two options to load the data. The first option is to use PostgreSQL/PostGIS to do it and put it in a database (DB) and then access it with the help of R's packages DBI and RporstgreSQL. The second option is simply to use R and sf.

We will start with the second, because it doesn't involve using SQL but we will provide you a quick way of using the first one with R afterward if you are curious.

Loading data with R and sf

library(sf)
library(dplyr)

# Loading restaurants,  =======================================================
# CSV doesn't have header
restaurants.dat <- read.csv("data/restaurants.csv",
                        header = FALSE,
                        col.names = c("id",
                                      "lat",
                                      "lon")) 

# classic lat/long column to sf see https://geocompr.robinlovelace.net/spatial-class.html                        
restaurants.shp <- sf::st_as_sf(restaurants.dat,
                                coords = c("lon", "lat"), 
                                crs = 4326)

# Here we used the same CRS recommended in PostGIS in action 
# EPSG:2163 is also a projected CRS which is important to calculate distance and do buffer
# see : https://geocompr.robinlovelace.net/reproj-geo-data.html#which-crs-to-use 
restaurants2163.shp <- sf::st_transform(restaurants.shp, 2163)

# Loading highways ============================================================

roads.shp <- sf::st_read("data/roadtrl020.shp",
                            crs = 4269)

# same projection as our restaurants
roads2163.shp <- sf::st_transform(roads.shp, 2163)

# Creating franchises table ===================================================
# this is not really needed for our example but PostGIS uses it so why not add it!
# this also shows one good DB practice: Normalisation  

franchises.dat <- data.frame(
    id = c("BKG",  "CJR",  "HDE", "INO", "JIB", "KFC", "MCD", "PZH", "TCB", "WDY"),
    franchise = c("Burger King", "Carl''s Jr", 
                  "Hardee", "In-N-Out",
                  "Jack in the Box",  "Kentucky Fried Chicken",
                  "McDonald", "Pizza Hut",
                  "Taco Bell", "Wendys"))

Now we have the 3 tables. You can explore them a bit. restaurants2163.shp has 50 002 features (points) and roads2163.shp has 47 014 features with 10 fields. We have roads but we want highways. In SQL it was done while populating the table with the "WHERE" clause:

INSERT INTO ch01.highways (gid, feature, name, state, geom)
SELECT gid, feature, name, state, ST_Transform(geom, 2163)
FROM ch01.highways_staging
WHERE feature LIKE 'Principal Highway%';

roads2163.shp has a field called feature where the type of road is stored (to mess with your GIS brain, where feature indicates an observation!). This is a bit tricky because you have Principal Highway but also Principal Highway Business Route and a bunch of other options. In PostgreSQL, % is a wildcard used with the LIKE operator to represent 0, 1 or multiple characters. In R we use grepl() with ^Principal Highway to create a vector with all the values needed and then we could use it with the %in% operator to return only the roads that match the pattern. In the end we get 16433 highways.

highway_type <- unique(roads2163.shp$FEATURE)

principal_highway <- highway_type[grepl(pattern = "^Principal Highway", 
                                        highway_type)]

highways2163.shp <- roads2163.shp[roads2163.shp$FEATURE %in% principal_highway,]

If we compare in terms of lines of code, doing it with R is shorter. This is thanks to sf and also because with PostgreSQL/PostGIS we have to define tables and relations between them (ie we have a strict data model). Both options have their pros & cons.

Loading data from PostGIS

If you followed the book's instructions and loaded data in a spatial database you can also access it with R and sf. Feel free to skip this part if you prefer to read about buffer versus sf::st_is_within_distance().

I usually create a file called code.R where I store the DB connection information (user, password, etc) and I add this file in .gitignore. Then you can create a small function that calls it and creates a connection to the DB. Here is an example (it can be improved!):

connect_to_db  <- function() {
# 1 Load library and code =====================================================
source("code.R", local = TRUE) # local = TRUE load in function envt

library(RPostgreSQL) 

# 2 establish connection  =====================================================

drv <- dbDriver("PostgreSQL")

# db, adress, usr, pwd are all defined in code.R 
con <- dbConnect(drv, 
                 dbname = db,
                 host = adress , 
                 # the default port for postgres I usually change it when the DB is not on localhost
                 port = 5432, 
                 user = usr, 
                 password = pwd) 

# The connection needs to be in globalenv to be used 
# If you have a better way of doing it I will take it! 
assign("con", con, envir = .GlobalEnv)

# 3 Print useful stuff  =====================================================

# list tables in the DB and check connection 
print(DBI::dbListTables(con))

# a reminder 
print("Reminder: use DBI::dbDisconnect(con)!")
}

After that, you can use your function to create a connection to the DB and query the data using sf.

connect_to_db()
# sf is awesome! This also illustrates how easily R interacts with other languages/systems. 
restaurants2163.shp <- sf::st_read(con, query = "select * from ch01.restaurants;")
highways2163.shp  <- sf::st_read(con, query = "select * from ch01.highways;")
franchises.dat <- DBI::dbGetQuery(con, statement = "select * from ch01.lu_franchises;")

# Be friendly with your sys admin and disconnect!
dbDisconnect(con)

Finding the number of fast-food restaurants within one mile of a highway!

The PostGIS way :

Let's first see the PostGIS way of doing it:

SELECT f.franchise
 , COUNT(DISTINCT r.id) As total -- <1>
FROM ch01.restaurants As r 
  INNER JOIN ch01.lu_franchises As f ON r.franchise = f.id
    INNER JOIN ch01.highways As h 
      ON ST_DWithin(r.geom, h.geom, 1609) -- <2>
GROUP BY f.franchise
ORDER BY total DESC;

The real "magic" is in <2> with the INNER JOIN. The ST_DWithin() function will return TRUE or FALSE if the geometry is within the distance. Yes, 1609 m is one mile for those like me who live in the metric system utopia. You have a nice trick on <1> with a DISTINCT inside a COUNT() to remove duplicate id's because one restaurant can be counted twice (or more) if it meets the condition on more than one highway. The rest is just either classic SQL organizing data or some aliasing needed in PostgreSQL.

This is the result:

franchise	total
McDonald	5343
Burger King	3049
Pizza Hut	2920
Wendys	2446
Taco Bell	2428
Kentucky Fried chicken	2371
Hardee	1077
Jack in the Box	509
Carl's Jr	224
In-N-Out	44

Thanks to \timing we know that this was done in 539,606 ms on my three year old laptop with 8GB of RAM. This is really quick but keep in mind most of the work, like adding a spatial index, was done before.

Doing it with R

At first we tried to do it with the sf function st_is_within_distance()

dist_to_check <- 1609

# I have run this code but you shouldn't unless you have plenty of time!
# timer <- Sys.time()
# the difficult part is the use of lengthS() <- see the S
# see: https://geocompr.robinlovelace.net/spatial-operations.html#topological-relations 
# near_or_not <- lengths(sf::st_is_within_distance(restaurants2163.shp,
#                                                              highways2163.shp,
#                                                              dist_to_check)) > 0
# 
# resto_near_roadv2 <- restaurants2163.shp[near_or_not,]
# Timing_result <- difftime(Sys.time(),  timer)

It take us a bit less than 13 mins. A lot of time and it also crashed Rstudio, probably because I didn't have enough memory.

When you just care about how many points are within a specific distance you can also use some buffer: we can buffer the points and see which ones intersect our lines. We could also have buffered the road and counted points in it.

timer <- Sys.time()
# step 1 buffer
buffered_restaurants <- sf::st_buffer(restaurants2163.shp, dist = dist_to_check) 
# step 2 buffered restaurant intersecting a highway
resto_near_road <- buffered_restaurants[highways2163.shp,]
Timing_result <- difftime(Sys.time(),  timer)

This was just a bit more than 6 seconds. Yes, what a difference!

We then just need to use some dplyr and we are done:

# this is just tidyverse go brrr!
resto_near_road %>% 
    sf::st_drop_geometry() %>% 
    dplyr::left_join(franchises.dat, 
                     by = c("franchise" = "id")) %>% 
    dplyr::group_by(franchise) %>% 
    dplyr::summarise(total = dplyr::n()) %>% 
    dplyr::arrange(dplyr::desc(total))

If we compare that result with the one we get with Postgis and the one we get from st_is_within_distance().We have one more Wendy's restaurants in st_is_within_distance() and postgis than with the buffer's way.

A more accurate benchmark?

To produce a bit more accurate benchmark we will need to take a sample of this data set. Prototyping our scripts on a smaller subset is also a good practice. For that we will use the library spData to get the US States and just look at the fast-foods restaurants and highways in Utah.

library(spData)

Utah <- us_states[us_states$NAME == "Utah",]
Utah <- sf::st_transform(Utah, 2163)

resto_utah.shp <- restaurants2163.shp[Utah,]
road_utah.shp <- highways2163.shp[Utah,]

Now we can use the microbenchmark package to get better indicators than our sys.time() tricks.

library(microbenchmark)

microbenchmark(
    buffer = {
        buffered_restaurants <- sf::st_buffer(resto_utah.shp, dist = dist_to_check) 
        resto_near_road <- buffered_restaurants[road_utah.shp,]
    },
    within_distance ={
        near_or_not <- lengths(sf::st_is_within_distance(resto_utah.shp,
                                                             road_utah.shp,
                                                             dist_to_check)) > 0
        resto_near_roadv2 <- restaurants2163.shp[near_or_not,]
    },
    times = 100
 )

expr	min	lq	mean	median	uq	max	neval
buffer	118.7830	123.9558	132.3502	125.8452	132.0432	235.9454	100
with_distance	618.5249	633.8456	672.2927	640.7561	668.1274	1149.8452	100

The buffer version is 5 times quicker! So if you just want to know if some points are close enough to other objects, it is better to do it with a buffer. If you need to have a list containing each index of objects matching the predicate for every point (way more information), you should use the st_is_within_distance() functions.

covid - review

@Robinlovelace some of my comments after looking and the blog post:

It would be probably good not to download world data directly in the blog post. Maybe it is better to read it from file (or similar) with some hidden code?
There are many code chunks (and their outputs) at the beginning of the blog post. Maybe some are not necessary?
Not all animations are visible in the blog post
I think the blog post needs some more text explaining the ideas and the code

Feel free to change whatever you think is necessary and publish the blog post

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.