Giter Site home page Giter Site logo

robinlovelace / spanishoddata Goto Github PK

View Code? Open in Web Editor NEW
7.0 2.0 0.0 1.76 MB

Work-in-progress package to get OD data. We may move the repo at some point..

Home Page: https://robinlovelace.github.io/spanishoddata/

License: Other

R 100.00%
cdr data data-package mobility origin-destination rstats mobile-telephone-data

spanishoddata's Introduction

This repo demonstrates how to download and use OD data from Spain, published by transportes.gob.es

The data is provided as follows:

  • Estudios basicos
    • Por disitritos
      • Personas (population)
      • Pernoctaciones (overnight stays)
      • Viajes
        • ficheros-diarios
        • meses-completos

The package is designed to save people time by providing the data in analyis-ready formats. Automating the process of downloading, cleaning and importing the data can also reduce the risk of errors in the laborious process of data preparation.

The datasets are large, so the package aims to reduce computational resources, by using computationally efficient packages behind the scenes. If you want to use many of the data files, it’s recommended you set a data directory where the package will look for the data, only downloading the files that are not already present.

Set the data directory by setting the SPANISH_OD_DATA_DIR environment variable, e.g. the following command:

usethis::edit_r_environ()
# Then set the data director, by typing this line in the file:
SPANISH_OD_DATA_DIR = "/path/to/data"
if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}
remotes::install_cran("duckdb")
library(duckdb)
library(tidyverse)
theme_set(theme_minimal())
devtools::load_all()
sf::sf_use_s2(FALSE)

Get metadata for the datasets as follows:

metadata = get_metadata()
metadata
# A tibble: 8,453 × 6
   target_url           pub_ts              file_extension data_ym data_ymd
   <chr>                <dttm>              <chr>          <date>  <date>  
 1 https://movilidad-o… 2024-05-16 09:00:45 tar            NA      NA      
 2 https://movilidad-o… 2024-05-16 08:58:20 tar            NA      NA      
 3 https://movilidad-o… 2024-05-16 08:58:18 tar            NA      NA      
 4 https://movilidad-o… 2024-05-16 08:57:24 tar            NA      NA      
 5 https://movilidad-o… 2024-05-16 08:55:49 tar            NA      NA      
 6 https://movilidad-o… 2024-05-16 08:55:47 tar            NA      NA      
 7 https://movilidad-o… 2024-05-16 08:55:10 tar            NA      NA      
 8 https://movilidad-o… 2024-05-16 08:54:13 tar            NA      NA      
 9 https://movilidad-o… 2024-05-16 08:54:11 tar            NA      NA      
10 https://movilidad-o… 2024-05-16 08:52:26 tar            NA      NA      
# ℹ 8,443 more rows
# ℹ 1 more variable: local_path <chr>

Zones

Zones can be downloaded as follows:

distritos = get_zones(type = "distritos")
distritos_wgs84 = sf::st_transform(distritos, 4326)
plot(distritos_wgs84)

Estudios basicos

Each day in the ficheros-diarios folder contains a file with the following columns:

# set timeout for downloads
options(timeout = 600) # 10 minutes
u1 = "https://movilidad-opendata.mitma.es/estudios_basicos/por-distritos/viajes/ficheros-diarios/2024-03/20240301_Viajes_distritos.csv.gz"
f1 = basename(u1)
if (!file.exists(f1)) {
  download.file(u1, f1)
}
drv = duckdb::duckdb("daily.duckdb")
con = DBI::dbConnect(drv)
od1 = duckdb::tbl_file(con, f1)
# colnames(od1)
#  [1] "fecha"                   "periodo"                
#  [3] "origen"                  "destino"                
#  [5] "distancia"               "actividad_origen"       
#  [7] "actividad_destino"       "estudio_origen_posible" 
#  [9] "estudio_destino_posible" "residencia"             
# [11] "renta"                   "edad"                   
# [13] "sexo"                    "viajes"                 
# [15] "viajes_km"
od1_head = od1 |>
  head() |>
  collect()
od1_head |>
  knitr::kable()
fecha periodo origen destino distancia actividad_origen actividad_destino estudio_origen_posible estudio_destino_posible residencia renta edad sexo viajes viajes_km
20240301 19 01009_AM 01001 0.5-2 frecuente casa no no 01 10-15 NA NA 5.124 6.120
20240301 15 01002 01001 10-50 frecuente casa no no 01 10-15 NA NA 2.360 100.036
20240301 00 01009_AM 01001 10-50 frecuente casa no no 01 10-15 NA NA 1.743 22.293
20240301 05 01009_AM 01001 10-50 frecuente casa no no 01 10-15 NA NA 2.404 24.659
20240301 06 01009_AM 01001 10-50 frecuente casa no no 01 10-15 NA NA 5.124 80.118
20240301 09 01009_AM 01001 10-50 frecuente casa no no 01 10-15 NA NA 7.019 93.938
DBI::dbDisconnect(con)

You can get the same result, but for multiple files, as follows:

od_multi_list = get_od(
  subdir = "estudios_basicos/por-distritos/viajes/ficheros-diarios",
  date_regex = "2024-03-0[1-7]"
)
od_multi_list[[1]]
# Source:   SQL [?? x 15]
# Database: DuckDB v0.10.2 [robin@Linux 6.5.0-35-generic:R 4.4.0/:memory:]
      fecha periodo origen  destino distancia actividad_origen actividad_destino
      <dbl> <chr>   <chr>   <chr>   <chr>     <chr>            <chr>            
 1 20240307 00      01009_… 01001   0.5-2     frecuente        casa             
 2 20240307 09      01009_… 01001   0.5-2     frecuente        casa             
 3 20240307 18      01009_… 01001   0.5-2     frecuente        casa             
 4 20240307 19      01009_… 01001   0.5-2     frecuente        casa             
 5 20240307 20      01009_… 01001   0.5-2     frecuente        casa             
 6 20240307 14      01002   01001   10-50     frecuente        casa             
 7 20240307 22      01002   01001   10-50     frecuente        casa             
 8 20240307 06      01009_… 01001   10-50     frecuente        casa             
 9 20240307 09      01009_… 01001   10-50     frecuente        casa             
10 20240307 11      01009_… 01001   10-50     frecuente        casa             
# ℹ more rows
# ℹ 8 more variables: estudio_origen_posible <chr>,
#   estudio_destino_posible <chr>, residencia <chr>, renta <chr>, edad <chr>,
#   sexo <chr>, viajes <dbl>, viajes_km <dbl>
class(od_multi_list[[1]])
[1] "tbl_duckdb_connection" "tbl_dbi"               "tbl_sql"              
[4] "tbl_lazy"              "tbl"                  

The result is a list of duckdb tables which load almost instantly, and can be used with dplyr functions. Let’s do an aggregation to find the total number trips per hour over the 7 days:

n_per_hour = od_multi_list |>
  map(~ .x |>
        group_by(periodo, fecha) |>
        summarise(n = n(), Trips = sum(viajes)) |>
        collect()
  ) |>
  list_rbind() |>
  mutate(Time = lubridate::ymd_h(paste0(fecha, periodo))) |>
  mutate(Day = lubridate::wday(Time, label = TRUE)) 
n_per_hour |>
  ggplot(aes(x = Time, y = Trips)) +
  geom_line(aes(colour = Day)) +
  labs(title = "Number of trips per hour over 7 days")

The figure above summarises 925,874,012 trips over the 7 days associated with 135,866,524 records.

Desire lines

We’ll use the same input data to pick-out the most important flows in Spain, with a focus on longer trips for visualisation:

od_large = od_multi_list |>
  map(~ .x |>
        group_by(origen, destino) |>
        summarise(Trips = sum(viajes), .groups = "drop") |>
        filter(Trips > 500) |>
        collect()
  ) |>
  list_rbind() |>
  group_by(origen, destino) |>
  summarise(Trips = sum(Trips)) |>
  arrange(desc(Trips))
od_large
# A tibble: 37,023 × 3
# Groups:   origen [3,723]
   origen  destino    Trips
   <chr>   <chr>      <dbl>
 1 2807908 2807908 2441404.
 2 0801910 0801910 2112188.
 3 0801902 0801902 2013618.
 4 2807916 2807916 1821504.
 5 2807911 2807911 1785981.
 6 04902   04902   1690606.
 7 2807913 2807913 1504484.
 8 2807910 2807910 1299586.
 9 0704004 0704004 1287122.
10 28106   28106   1286058.
# ℹ 37,013 more rows

The results show that the largest flows are intra-zonal. Let’s keep only the inter-zonal flows:

od_large_interzonal = od_large |>
  filter(origen != destino)

We can convert these to geographic data with the {od} package:

od_large_interzonal_sf = od::od_to_sf(
  od_large_interzonal,
  z = distritos_wgs84
)
od_large_interzonal_sf |>
  ggplot() +
  geom_sf(aes(size = Trips), colour = "red") +
  theme_void()

Let’s focus on trips in and around a particular area (Salamanca):

salamanca_zones = zonebuilder::zb_zone("Salamanca")
distritos_salamanca = distritos_wgs84[salamanca_zones, ]
plot(distritos_salamanca)

We will use this information to subset the rows, to capture all movement within the study area:

ids_salamanca = distritos_salamanca$ID
od_salamanca = od_multi_list |>
  map(~ .x |>
        filter(origen %in% ids_salamanca) |>
        filter(destino %in% ids_salamanca) |>
        collect()
  ) |>
  list_rbind() |>
  group_by(origen, destino) |>
  summarise(Trips = sum(viajes)) |>
  arrange(Trips)

Let’s plot the results:

od_salamanca_sf = od::od_to_sf(
  od_salamanca,
  z = distritos_salamanca
)
od_salamanca_sf |>
  filter(origen != destino) |>
  ggplot() +
  geom_sf(aes(colour = Trips), size = 1) +
  scale_colour_viridis_c() +
  theme_void()

Disaggregating desire lines

For this you’ll need some additional dependencies:

library(sf)
library(tmap)

We’ll get the road network from OSM:

salamanca_boundary = sf::st_union(distritos_salamanca)
osm_full = osmactive::get_travel_network(salamanca_boundary)
osm = osm_full[salamanca_boundary, ]
drive_net = osmactive::get_driving_network(osm)
drive_net_major = osmactive::get_driving_network_major(osm)
cycle_net = osmactive::get_cycling_network(osm)
cycle_net = osmactive::distance_to_road(cycle_net, drive_net_major)
cycle_net = osmactive::classify_cycle_infrastructure(cycle_net)
map_net = osmactive::plot_osm_tmap(cycle_net)
map_net

We can use the road network to disaggregate the desire lines:

od_jittered = odjitter::jitter(
  od_salamanca_sf,
  zones = distritos_salamanca,
  subpoints = drive_net,
  disaggregation_threshold = 1000,
  disaggregation_key = "Trips"
)

Let’s plot the disaggregated desire lines:

od_jittered |>
  arrange(Trips) |>
  ggplot() +
  geom_sf(aes(colour = Trips), size = 1) +
  scale_colour_viridis_c() +
  geom_sf(data = drive_net_major, colour = "black") +
  theme_void()

spanishoddata's People

Contributors

e-kotov avatar robinlovelace avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

spanishoddata's Issues

the package should support both v1 (2020-2021) and v2 data (2022 onwards)

Here is a point to discuss.

Functions such as get_metadata

get_metadata = function(data_dir = get_data_dir()) {
, as well as the the function to get xml
get_latest_v2_xml = function(
should support both v1 (2020-2021) data and v2 data (2022 onwards).

The datasets are not fully compatible, so the the possible approaches are (and they are not mutually exclusive, I guess):

  1. keep the function names distinct (like get_v2_metadata, get_v1_metadata, etc.). This way we nudge the user to read the methodology for both data.
  2. Provide (and highlight in examples and documentation) a wrapper for all v1/v2 functions but issue a warning that the data are not compatible due to different methodology.

Overall, I suggest we focus on v2 data first, but keeping an eye on v1 and it's differences from v2 (for example, in terms of available variables) and in general its existence in the design of functions for v2 is essential.

Note: restructure downloaded CSVs into hive style for better performance

By default CSV files are structured as follows:

estudios_basicos/por-distritos/viajes/ficheros-diarios/2022-01/20220101_Viajes_distritos.csv.gz
estudios_basicos/por-distritos/viajes/ficheros-diarios/2022-01/20220102_Viajes_distritos.csv.gz

This way, both {duckdb} and {arrow} will have to scan whole files for queries that involve a date filter, even though the data is already partitioned nicely into individual days.

Therefore, it is better to download data into a hive-style structure like so:

estudios_basicos/por-distritos/viajes/ficheros-diarios/year=2022/month=01/day=01/Viajes_distritos.csv.gz
estudios_basicos/por-distritos/viajes/ficheros-diarios/year=2022/month=01/day=02/Viajes_distritos.csv.gz

This way, even though we already have a full ISO date field inside the CSVs, both {duckdb} and {arrow} will be able to filter much faster using the columns generated from the hive-style file structure. The year, month and day columns can be dropped if not needed and they take literally no additional space anyway.

{duckdb} seems to support hive_partitioning = true for read_csv.

{arrow} definitely supports hive-style for connecting to CSV folders with open_dataset().

Data package?

Heads-up @e-kotov and @eugenividal, I'm reviving this after failed attempts a few years ago..

Could it be worth making a small package to make this dataset easier to use, along the lines of stats19 and geobr?

Style?

At present the package has both equals and arrow assigment, we should standardise e.g. by using styler or styler.equals..

Open data Spain: rOpenSpain

Nice package!

Maybe it would be interesting for this package to use other tools to analyze Spanish open data or plot it.
There is a github organization collecting this: @rOpenSpain, mapSpain package might be useful.
There are other packages that might complement yours.

If you want you can reallocate the repository there (you'll maintain full powers).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.