the package should support both v1 (2020-2021) and v2 data (2022 onwards)

This repo demonstrates how to download and use OD data from Spain, published by transportes.gob.es

The data is provided as follows:

Estudios basicos
- Por disitritos
  - Personas (population)
  - Pernoctaciones (overnight stays)
  - Viajes
    - ficheros-diarios
    - meses-completos

The package is designed to save people time by providing the data in analyis-ready formats. Automating the process of downloading, cleaning and importing the data can also reduce the risk of errors in the laborious process of data preparation.

The datasets are large, so the package aims to reduce computational resources, by using computationally efficient packages behind the scenes. If you want to use many of the data files, it’s recommended you set a data directory where the package will look for the data, only downloading the files that are not already present.

Set the data directory by setting the SPANISH_OD_DATA_DIR environment variable, e.g. the following command:

usethis::edit_r_environ()
# Then set the data director, by typing this line in the file:

SPANISH_OD_DATA_DIR = "/path/to/data"

if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}
remotes::install_cran("duckdb")

library(duckdb)
library(tidyverse)
theme_set(theme_minimal())
devtools::load_all()
sf::sf_use_s2(FALSE)

Get metadata for the datasets as follows:

metadata = get_metadata()
metadata

# A tibble: 8,453 × 6
   target_url           pub_ts              file_extension data_ym data_ymd
   <chr>                <dttm>              <chr>          <date>  <date>  
 1 https://movilidad-o… 2024-05-16 09:00:45 tar            NA      NA      
 2 https://movilidad-o… 2024-05-16 08:58:20 tar            NA      NA      
 3 https://movilidad-o… 2024-05-16 08:58:18 tar            NA      NA      
 4 https://movilidad-o… 2024-05-16 08:57:24 tar            NA      NA      
 5 https://movilidad-o… 2024-05-16 08:55:49 tar            NA      NA      
 6 https://movilidad-o… 2024-05-16 08:55:47 tar            NA      NA      
 7 https://movilidad-o… 2024-05-16 08:55:10 tar            NA      NA      
 8 https://movilidad-o… 2024-05-16 08:54:13 tar            NA      NA      
 9 https://movilidad-o… 2024-05-16 08:54:11 tar            NA      NA      
10 https://movilidad-o… 2024-05-16 08:52:26 tar            NA      NA      
# ℹ 8,443 more rows
# ℹ 1 more variable: local_path <chr>

Zones

Zones can be downloaded as follows:

distritos = get_zones(type = "distritos")
distritos_wgs84 = sf::st_transform(distritos, 4326)
plot(distritos_wgs84)

Estudios basicos

Each day in the ficheros-diarios folder contains a file with the following columns:

# set timeout for downloads
options(timeout = 600) # 10 minutes
u1 = "https://movilidad-opendata.mitma.es/estudios_basicos/por-distritos/viajes/ficheros-diarios/2024-03/20240301_Viajes_distritos.csv.gz"
f1 = basename(u1)
if (!file.exists(f1)) {
  download.file(u1, f1)
}
drv = duckdb::duckdb("daily.duckdb")
con = DBI::dbConnect(drv)
od1 = duckdb::tbl_file(con, f1)
# colnames(od1)
#  [1] "fecha"                   "periodo"                
#  [3] "origen"                  "destino"                
#  [5] "distancia"               "actividad_origen"       
#  [7] "actividad_destino"       "estudio_origen_posible" 
#  [9] "estudio_destino_posible" "residencia"             
# [11] "renta"                   "edad"                   
# [13] "sexo"                    "viajes"                 
# [15] "viajes_km"
od1_head = od1 |>
  head() |>
  collect()
od1_head |>
  knitr::kable()

fecha	periodo	origen	destino	distancia	actividad_origen	actividad_destino	estudio_origen_posible	estudio_destino_posible	residencia	renta	edad	sexo	viajes	viajes_km
20240301	19	01009_AM	01001	0.5-2	frecuente	casa	no	no	01	10-15	NA	NA	5.124	6.120
20240301	15	01002	01001	10-50	frecuente	casa	no	no	01	10-15	NA	NA	2.360	100.036
20240301	00	01009_AM	01001	10-50	frecuente	casa	no	no	01	10-15	NA	NA	1.743	22.293
20240301	05	01009_AM	01001	10-50	frecuente	casa	no	no	01	10-15	NA	NA	2.404	24.659
20240301	06	01009_AM	01001	10-50	frecuente	casa	no	no	01	10-15	NA	NA	5.124	80.118
20240301	09	01009_AM	01001	10-50	frecuente	casa	no	no	01	10-15	NA	NA	7.019	93.938

DBI::dbDisconnect(con)

You can get the same result, but for multiple files, as follows:

od_multi_list = get_od(
  subdir = "estudios_basicos/por-distritos/viajes/ficheros-diarios",
  date_regex = "2024-03-0[1-7]"
)
od_multi_list[[1]]

# Source:   SQL [?? x 15]
# Database: DuckDB v0.10.2 [robin@Linux 6.5.0-35-generic:R 4.4.0/:memory:]
      fecha periodo origen  destino distancia actividad_origen actividad_destino
      <dbl> <chr>   <chr>   <chr>   <chr>     <chr>            <chr>            
 1 20240307 00      01009_… 01001   0.5-2     frecuente        casa             
 2 20240307 09      01009_… 01001   0.5-2     frecuente        casa             
 3 20240307 18      01009_… 01001   0.5-2     frecuente        casa             
 4 20240307 19      01009_… 01001   0.5-2     frecuente        casa             
 5 20240307 20      01009_… 01001   0.5-2     frecuente        casa             
 6 20240307 14      01002   01001   10-50     frecuente        casa             
 7 20240307 22      01002   01001   10-50     frecuente        casa             
 8 20240307 06      01009_… 01001   10-50     frecuente        casa             
 9 20240307 09      01009_… 01001   10-50     frecuente        casa             
10 20240307 11      01009_… 01001   10-50     frecuente        casa             
# ℹ more rows
# ℹ 8 more variables: estudio_origen_posible <chr>,
#   estudio_destino_posible <chr>, residencia <chr>, renta <chr>, edad <chr>,
#   sexo <chr>, viajes <dbl>, viajes_km <dbl>

class(od_multi_list[[1]])

[1] "tbl_duckdb_connection" "tbl_dbi"               "tbl_sql"              
[4] "tbl_lazy"              "tbl"

The result is a list of duckdb tables which load almost instantly, and can be used with dplyr functions. Let’s do an aggregation to find the total number trips per hour over the 7 days:

n_per_hour = od_multi_list |>
  map(~ .x |>
        group_by(periodo, fecha) |>
        summarise(n = n(), Trips = sum(viajes)) |>
        collect()
  ) |>
  list_rbind() |>
  mutate(Time = lubridate::ymd_h(paste0(fecha, periodo))) |>
  mutate(Day = lubridate::wday(Time, label = TRUE)) 
n_per_hour |>
  ggplot(aes(x = Time, y = Trips)) +
  geom_line(aes(colour = Day)) +
  labs(title = "Number of trips per hour over 7 days")

The figure above summarises 925,874,012 trips over the 7 days associated with 135,866,524 records.

Desire lines

We’ll use the same input data to pick-out the most important flows in Spain, with a focus on longer trips for visualisation:

od_large = od_multi_list |>
  map(~ .x |>
        group_by(origen, destino) |>
        summarise(Trips = sum(viajes), .groups = "drop") |>
        filter(Trips > 500) |>
        collect()
  ) |>
  list_rbind() |>
  group_by(origen, destino) |>
  summarise(Trips = sum(Trips)) |>
  arrange(desc(Trips))
od_large

# A tibble: 37,023 × 3
# Groups:   origen [3,723]
   origen  destino    Trips
   <chr>   <chr>      <dbl>
 1 2807908 2807908 2441404.
 2 0801910 0801910 2112188.
 3 0801902 0801902 2013618.
 4 2807916 2807916 1821504.
 5 2807911 2807911 1785981.
 6 04902   04902   1690606.
 7 2807913 2807913 1504484.
 8 2807910 2807910 1299586.
 9 0704004 0704004 1287122.
10 28106   28106   1286058.
# ℹ 37,013 more rows

The results show that the largest flows are intra-zonal. Let’s keep only the inter-zonal flows:

od_large_interzonal = od_large |>
  filter(origen != destino)

We can convert these to geographic data with the {od} package:

od_large_interzonal_sf = od::od_to_sf(
  od_large_interzonal,
  z = distritos_wgs84
)
od_large_interzonal_sf |>
  ggplot() +
  geom_sf(aes(size = Trips), colour = "red") +
  theme_void()

Let’s focus on trips in and around a particular area (Salamanca):

salamanca_zones = zonebuilder::zb_zone("Salamanca")
distritos_salamanca = distritos_wgs84[salamanca_zones, ]
plot(distritos_salamanca)

We will use this information to subset the rows, to capture all movement within the study area:

ids_salamanca = distritos_salamanca$ID
od_salamanca = od_multi_list |>
  map(~ .x |>
        filter(origen %in% ids_salamanca) |>
        filter(destino %in% ids_salamanca) |>
        collect()
  ) |>
  list_rbind() |>
  group_by(origen, destino) |>
  summarise(Trips = sum(viajes)) |>
  arrange(Trips)

Let’s plot the results:

od_salamanca_sf = od::od_to_sf(
  od_salamanca,
  z = distritos_salamanca
)
od_salamanca_sf |>
  filter(origen != destino) |>
  ggplot() +
  geom_sf(aes(colour = Trips), size = 1) +
  scale_colour_viridis_c() +
  theme_void()

Disaggregating desire lines

For this you’ll need some additional dependencies:

library(sf)
library(tmap)

We’ll get the road network from OSM:

salamanca_boundary = sf::st_union(distritos_salamanca)
osm_full = osmactive::get_travel_network(salamanca_boundary)

osm = osm_full[salamanca_boundary, ]
drive_net = osmactive::get_driving_network(osm)
drive_net_major = osmactive::get_driving_network_major(osm)
cycle_net = osmactive::get_cycling_network(osm)
cycle_net = osmactive::distance_to_road(cycle_net, drive_net_major)
cycle_net = osmactive::classify_cycle_infrastructure(cycle_net)
map_net = osmactive::plot_osm_tmap(cycle_net)
map_net

We can use the road network to disaggregate the desire lines:

od_jittered = odjitter::jitter(
  od_salamanca_sf,
  zones = distritos_salamanca,
  subpoints = drive_net,
  disaggregation_threshold = 1000,
  disaggregation_key = "Trips"
)

Let’s plot the disaggregated desire lines:

od_jittered |>
  arrange(Trips) |>
  ggplot() +
  geom_sf(aes(colour = Trips), size = 1) +
  scale_colour_viridis_c() +
  geom_sf(data = drive_net_major, colour = "black") +
  theme_void()

robinlovelace / spanishoddata Goto Github PK

spanishoddata's Introduction

Zones

Estudios basicos

Desire lines

Disaggregating desire lines

spanishoddata's People

Contributors

Stargazers

Watchers

spanishoddata's Issues

Recommend Projects

Recommend Topics

Recommend Org