mlr-org / mlr3spatiotempcv Goto Github PK

View Code? Open in Web Editor NEW

47.0 13.0 9.0 435.2 MB

Spatiotemporal resampling methods for mlr3

Home Page: https://mlr3spatiotempcv.mlr-org.com

License: GNU Lesser General Public License v3.0

R 29.57% TeX 70.43%

mlr3 resampling spatial cross-validation temporal resampling-methods r r-package

mlr3spatiotempcv's Introduction

mlr3spatiotempcv

Package website: release | dev

Spatiotemporal resampling methods for mlr3.

This package extends the mlr3 package framework with spatiotemporal resampling and visualization methods.

If you prefer the tidymodels ecosystem, have a look at the {spatialsample} package for spatial sampling methods.

Installation

CRAN version

install.packages("mlr3spatiotempcv")

Development version

remotes::install_github("mlr-org/mlr3spatiotempcv")

# R Universe Repo
install.packages('mlr3spatiotempcv', mlrorg = 'https://mlr-org.r-universe.dev')

Get Started

See the "Get Started" vignette for a quick introduction.

For more detailed information including an usage example see the "Spatiotemporal Analysis" chapter in the mlr3book.

Article "Spatiotemporal Visualization" shows how 3D subplots grids can be created.

Citation

To cite the package in publications, use the output of citation("mlr3spatiotempcv").

Resources

Other spatiotemporal resampling packages

This list does not claim to be comprehensive.

(Disclaimer: Because CRAN does not like DOI URLs in their automated checks, direct linking to scientific articles is not possible...)

Name	Language	Resources
blockCV	R	CRAN
CAST	R	Paper, CRAN
ENMeval	R	CRAN
spatialsample	R	CRAN
sperrorest	R	CRAN
Pyspatialml	Python	GitHub
spacv	Python	GitHub
Museo Toolbox	Python	Paper, GitHub
spatial-kfold	Python	GitHub

FAQ

Which resampling method should I use?

There is no single-best resampling method. It depends on your dataset characteristics and what your model should is about to predict on. The resampling scheme should reflect the final purpose of the model - this concept is called "target-oriented" resampling. For example, if the model was trained on multiple forest plots and its purpose is to predict something on unknown forest stands, the resampling structure should reflect this.

Are there more resampling methods than the one {mlr3spatiotempcv} offers?

{mlr3spatiotempcv} aims to offer all resampling methods that exist in R. Though this does not mean that it covers all resampling methods. If there are some that you are missing, feel free to open an issue.

How can I use the "blocking" concept of the old {mlr}?

This concept is now supported via the "column roles" concept available in {mlr3} [Task](https://mlr3.mlr-org.com/reference/Task.html) objects. See [this documentation](https://mlr3.mlr-org.com/reference/Resampling.html#grouping-blocking) for more information.

For the methods that offer buffering, how can an appropriate value be chosen?

There is no easy answer to this question. Buffering train and test sets reduces the similarity between both. The degree of this reduction depends on the dataset itself and there is no general approach how to choosen an appropriate buffer size. Some studies used the distance at which the autocorrelation levels off. This buffer distance often removes quite a lot of observations and needs to be calculated first.

mlr3spatiotempcv's People

Contributors

Stargazers

Watchers

Forkers

jannes-m yangxhcaf nirvananimbusa lorenzwalthert memo1986 yatram yibiaozou ancao310 peterzs

mlr3spatiotempcv's Issues

Support for sf package

I was wondering if there was any planned support for sf dataframes to be used as task inputs? It may remove some of the arguments a user needs to provide when creating a task. A few useful functions that can be used to define some of the spatial inputs include:

# First, load some point data and sf library
data(meuse, package = "sp") # load data.frame from sp
library(sf)
x <- sf::st_as_sf(meuse, coords = c("x", "y"), crs = 28992)

# Generate the coordinate columns
sf::st_coordinates(x) 

# Find the names of the coordinate columns
attr(x, "sf_column")

# Extract CRS information of the sf dataframe
sf::st_crs(x)
sf::st_crs(x)$epsg # Gets numeric EPSG code
sf::st_crs(x)$wkt # Gets WKT string

# Ensure that the geometry type is point
sf::st_geometry_type(x, by_geometry = FALSE) # or
all(sf::st_is(x, "POINT"))

# Remove geometry list column for data backend
x_df <- sf::st_drop_geometry(x)

# Keep coordinates as features in the data
cbind(x_df, sf::st_coordinates(x))

No worries if there is no planned support, I'm just curious and offering some solutions just in case! I use the sf package quite a bit in my line of work for extracting raster covariate data, and have used these functions repeatedly. It's just a suggestion for more ease on the user end - rather than providing a dataframe with the x and y coordinates and having to specify the CRS and coordinate names from the dataframe, a user could simply provide the sf dataframe and specify whether to use coordinates as features.

Thanks for all the hard work on this package! I use it frequently and it works really well!

Coordinates in the mlr3 Task object

Hi, I am trying to generate a task with my own data as it was shown in some tutorials for mlr. I do some random data and have a data.frame with coordinates. So that when I define the TaskRegr$new( .... , coordinates = coords) which results in:

Error in .subset2(public_bind_env, "initialize")(...) :
unused argument (coordinates = coords)

adding coordinates manually to the task does not work either. It would be great to have here a full example from some raw data of a regression problem where one wants to predict y from X and has lat, lon coords to consider in the CV.

Add option to construct repeated resamplings

Similar to mlr3::ResamplingRepeatedCV.

spcv-coords
spcv-env
spcv-block

spcv-buffer is LOOCV and has no repeats.

cc @jannes-m

Add `autoplot()` for non-spatial rsmp

Add 2D plot method for Cstf method

If only space_var is used, a 2D ggplot can be created similar to other spatial-only methods.

This method should also show the omitted points optionally.

Spatial CV failed with mlr3tuning

I want to evaluate the performance of hyperparameters of a spatial datasets within spatial CV. Unfortunately, while non-spatial CV or bootstrap work in the mlr³tuning instance, spcv-coords and repeated-spcv-coords produce the error:

Error in benchmark_grid(self$task, learners, self$resampling) (mlr3_issue.R#43): Resampling is instantiated for a task with a different number of observations

I'm not sure whether the error occurred in my code (or in the idea of the spatial cross-validating hyperparameters) or the feature is not provided by mlr3spatiotempcv. However even if I instantiate the resampling method to the task, the tuning instance reproduces the error. (tuning_resampling$instantiate(task))
A reproducible example from the ecuador dataset:

library("mlr3")
library("mlr3spatiotempcv")
library("mlr3tuning")
library("paradox")

task = tsk("ecuador")
learner = lrn("classif.rpart", predict_type = "prob")

# tune hyperparameter cp
param_set = ps(cp = p_dbl(lower = -5, upper = 0, trafo = function(x) 10^x))

# AUROC suitable for binary classification tasks
measure = msr("classif.auc")

# 10 evaluations
terminator = trm("evals", n_evals = 10)

# random search: best balance between computation time and search space grazing
tuner = tnr("random_search")

# inner resampling method
tuning_resampling = rsmp("spcv-coords", folds = 10)
# tuning_resampling$instantiate(task)

instance = TuningInstanceSingleCrit$new(
    task = task,
    learner = learner,
    resampling = tuning_resampling,
    measure = measure,
    search_space = param_set,
    terminator = terminator
)

tuner$optimize(instance)

#' Error in benchmark_grid(self$task, learners, self$resampling) (mlr3_issue.R#43):
#' Resampling is instantiated for a task with a different number of observations

Need for gap filling approach within mlr3spatiotempcv for time-series data

Hi,

Can we have a gap filling approach within mlr3spatiotempcv?

Here are some examples of the problem and solutions for your reference:
LINK1
LINK2
LINK3

Resampling: Implement "Spatial blocking"

From package blockCV: https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13107

Check whether CLUTO operates in the feature space or spatiotemporal space

Cleanup `autoplot()`

Move redundant parts into helper funs
Consider using separate help files?

Check re-enabling of vdiffr tests on CI

`autoplot()`: Make point sizes of `geom_sf` configurable

By passing down the ellipsis args of autoplot().

Group pkgdown reference index

Resampling: Implement "Buffering"

From package blockCV: besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13107

make spatiotempcv compatible with other packages

Currently mlr3spatiotempcv overwrites col_roles and row_roles in zzz.R.
This makes it non-compatible with other packages, i.e. if I load a package that adds a different set of col_roles, those are overwritten by mlr3spatiotempcv. We should append instead of overwriting here.
The fact that the task_type is non-unique after loading mlr3spatiotempcv leads to tiny problems in mlr3pipelines. We should discuss at a higher level how we expect packages in the mlr3verse to behave here, as it is not completely clear how things should work here.
We will fix find a work-around in mlr3pipelines for now.

Resampling: Add spatial methods

Prerequisite for spatial stuff: We need a place to store the coordinates in the task. This needs to be enabled in mlr3.

As mentioned here my vision is to make mlr3 THE place for spatial/spatio-temporal resampling methods (there are > 5 methods).

Add TaskRegrST

Add more introductory information about autocorrelation

including formulas
explain why custom resampling methods are needed and important in the first place

Add a spatial regr example task

TaskClassifST fails with ordered factor

Hi Patrick,

any reason why you restrict the class of the target column to either factor or character.
I have an ordered factor as response, which fails in task creation with TaskClassifST$new(),
but it works with TaskClassif$new()

mlr3spatiotempcv/R/TaskClassifST.R

Line 70 in 582d6f0

if (info$type %nin% c("factor", "character")) {

See also this issue, where the question was arised for mlr3:
mlr-org/mlr3#95

Here's a little reprex:

library(mlr3verse)
#> Loading required package: mlr3
#> Loading required package: mlr3filters
#> Loading required package: mlr3learners
#> Loading required package: mlr3pipelines
#> Loading required package: mlr3tuning
#> Loading required package: mlr3viz
#> Loading required package: paradox

# remotes::install_github("mlr-org/mlr3spatiotempcv")
library(mlr3spatiotempcv)

brew <- mapview::breweries
brew$number.seasonal.beers <- factor(brew$number.seasonal.beers, ordered = TRUE)

brew <- cbind(sf::st_drop_geometry(brew), 
              sf::st_coordinates(brew))

# task 
task_nsb_o = TaskClassif$new(
  id = "nsb",
  backend = brew, 
  target = "number.seasonal.beers")

task_nsb_o
#> <TaskClassif:nsb> (224 x 10)
#> * Target: number.seasonal.beers
#> * Properties: multiclass
#> * Features (9):
#>   - dbl (4): X, Y, founded, number.of.types
#>   - chr (4): address, brewery, village, zipcode
#>   - fct (1): state

# task ST - ordered
task_nsb_ST_o = TaskClassifST$new(
  id = "nsbST_o",
  backend = brew, 
  target = "number.seasonal.beers", 
  extra_args = list(
    coordinate_names = c("X", "Y"),
    coords_as_features = FALSE,
    crs = "4326"))
#> Error: Target column 'number.seasonal.beers' must be a factor or character

# task ST - factor
brew$number.seasonal.beers <- factor(brew$number.seasonal.beers, ordered = FALSE)
task_nsb_ST_f = TaskClassifST$new(
  id = "nsbST_f",
  backend = brew, 
  target = "number.seasonal.beers", 
  extra_args = list(
    coordinate_names = c("X", "Y"),
    coords_as_features = FALSE,
    crs = "4326"))

task_nsb_ST_f
#> <TaskClassifST:nsbST_f> (224 x 8)
#> * Target: number.seasonal.beers
#> * Properties: multiclass
#> * Features (7):
#>   - chr (4): address, brewery, village, zipcode
#>   - dbl (2): founded, number.of.types
#>   - fct (1): state
#> * Coordinates:
#>             X        Y
#>   1: 10.88922 49.71979
#>   2: 11.23873 50.12579
#>   3: 10.85194 49.42080
#>   4: 10.07837 50.16197
#>   5:  9.97323 49.97720
#>  ---                  
#> 220: 10.93073 50.12684
#> 221: 11.54562 50.07220
#> 222: 11.50372 50.01548
#> 223: 11.55831 49.98518
#> 224: 11.07389 50.06172

^{Created on 2020-11-09 by the reprex package (v0.3.0)}

Visualization: Function to visualize resampling splits

Should work with all implemented resampling methods.

Resampling: Implement "Environmental blocking"

From package blockCV: https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13107

Resampling: k-means clustering

Implement k-means clustering after Brenning2012 (https://ieeexplore.ieee.org/document/6352393)

More descriptive / informative error messages for wrong task

Currently the error messages when using a wrong task with a spatial resampling methods do not really help with finding the error, this could be more descriptive.

library(mlr3); library(mlr3spatiotempcv);rsmp("spcv_coords")$instantiate(tsk("boston_housing"))
#> Error in as.matrix(x): attempt to apply non-function

^{Created on 2020-09-14 by the reprex package (v0.3.0)}

Move arguments of `instantiate()` to constructor?

The problem is that the user does not call instantiate() actively in a Graph Learner. Hence, required arguments like time_var for SptCV methods cannot be passed along.

They need to be populated from a field which is set during construction.
If we do so, we possibly can omit the need of arguments in instantiate() completely?

Explore a three-dimensional clustering method

To cluster in x, y and time.

Package skmeans looks promising.
Among other methods, it comes with an interface to CLUTO, which seems to specifically address high-dimensional clustering in GIS applications.

Support presence-background option in "Spatial Buffer CV"

If the target has a binary outcome, a presence-background approach (see blockCV::buffering) would be possible. Target needs to be transformed to 0/1 before sampling.

Resampling methods: Replace hyphens by underscores

To match the naming scheme of mlr3 which uses underscores (e.g. repeated_cv).

Cstf method references wrong method

RepeatedResamplingCstf accidentally references the Cluto method.

Add `ResamplingRepeatedSptCVCstf`

Model tuning does not work with mlr3spatiotempcv package

I've been at this issue for a while now and I figured I should report this and give some examples. In previous versions of mlr3 and associated packages, I was able to perform the following task:

Perform feature filtering on a dataset using variable importance filters (i.e.: tuning)
Construct a repeated spatial cross validation model using the filtered dataset
Select the best filtered model

I was attempting to carry that out again this week but I've hit quite the roadblock - it appears that tuning no longer plays nicely with the mlr3spatiotempcv package! Here is a reproducible example:

library(mlr3verse)
library(mlr3spatiotempcv)

task <- tsk("ecuador")

# This example uses the ranger package to do model and perform feature filtering
# In order to do this, pipeops need to be used
lrn <- lrn(
  "classif.ranger", 
  num.threads = parallel::detectCores(),
  importance = "impurity",
  predict_type = "prob"
)
po_lrn <- po("learner", lrn)

# Create feature filter based on variable importance
po_filter <- po("filter", filter = mlr3filters::flt("importance", learner = lrn))

# Create process (new learner) for filtering the task
glrn <- GraphLearner$new(po_filter %>>% po_lrn)
glrn$predict_type <- "prob"

# Create filter parameters 
param_set <- ParamSet$new(
  params = list(ParamDbl$new("importance.filter.frac", lower = 0.1, upper = 1))
)

# Create filtering instance
instance <- TuningInstanceSingleCrit$new(
  task = task,
  learner = glrn, 
  resampling =  rsmp("repeated_spcv_coords", folds = 10, repeats = 5), 
  measure = msr("classif.ce"),
  search_space = param_set, 
  terminator = trm("none")
)

# Create tuner
tuner <- tnr("grid_search", resolution = 10)
tuner$optimize(instance)

I am a novice when it comes to using mlr3 and pipelines, so something in my code might be problematic but as far as I can see, the pipeline is correct. I think the issue comes with the tuning aspect of this though - when a filter fraction is defined in glrn, the code executes the spatial cross validation properly:

glrn$param_set$values$importance.filter.frac = 0.3
rr <- resample(task, glrn, rsmp("repeated_spcv_coords", folds = 10, repeats = 5))

So I believe the issue to either be here or in the mlr3tuning package, not sure which so please redirect this issue if necessary. Thanks!

Mention more/older refs for the underlying concepts of all methods

Support resampling method based on predefined spatiotemporal groups

Just as CAST::CreateSpacetimeFolds() does.

I am not sure if this approach can work with all currently implemented spatial sampling methods.
Even if not, we should support exactly this way of creating resamplings since some people already asked me exactly for this.
@HannaMeyer Is there a dedicated name for your method? If not, do you want to make a proposal? :)
You can have a look at the current names of the other methods in the README.

It seems that @jannes-m has added temporal extension support for spcv-coords already.
Let's have a look how this works in detail.

Add tests for `ResamplingRepeatedSptCVCstf`

Store spcv-buffer more efficiently

as it is basically LOOCV
see #20

Instantiate spcv_coords for AutoTuner

Dear mlr3 team,

first of all, thanks for your efforts in developing this extension package, it is very much appreciated.

I am trying to apply spatial CV using "spcv_coords" to an AutoTuner in order to retrieve nested resampling following the process described in the mlr3 book

        RT.at_sp <- AutoTuner$new(
          learner = reg.tree,
          resampling = spatial_CV, 
          measure = opt.mse,
          search_space = param_set_RT,
          terminator = trm.evals,
          tuner = tnr.GridSearch)

However, I end up with the error message:

        "Error: Resampling 'spcv_coords' may not be instantiated".

The same error message remains, even if I try to instantiate the task manually beforehand using the command

        spatial_CV$instantiate(sp_task)

as described in 2.5.2.

As I am not an expert, do I make something wrong, or is spatial CV not yet implemented for use with AutoTuner?

Thank you very much!
BR, Jürgen

Optimize `mlr_reflections` behavior for package

Should TaskClassifST and friends have entries in mlr_reflections$task_types?

Consider switching to {patchwork} for gridded plots

Might work better with spacing between plots and label alignments?

Replace orphaned GSIF package

It contains the cookfarm example dataset.

Is this dataset available elsewhere?

https://github.com/envirometrix/landmap was mentioned as a replacement but it is not yet on CRAN.

Write usage section in mlr3book

Use {cli} for all message and stop calls

Use mlr3 sugar for creating resampling instances

rsmp() instead of mlr_resamplings$get()

misleading description of coordinates type in TaskRegrST

Dear spatio-temporal guys,
first of all, thanks for providing spatial cv via mlr3!!! I am playing around a bit with mlr3spatiotempcv. In the help file of TaskRegrST(), it says that coordinates should be a data.frame when it fact, you need to provide a character string indicating the column names of the coordinates found in the backend. A spatio-temporal example would be rather helpful indeed (as already pointed out in #22 and #24). If I can be of any help re the example, pls let me know.

Inspect "'k' is bigger than the number of the blocks" error

library(mlr3spatiotemporal)

library(mlr3)
task <- tsk("ecuador")

# Instantiate Resampling
rcv <- rsmp("spcv-block")
rcv$param_set$values <- list(folds = 20)
rcv$instantiate(task)
#> Error in blockCV::spatialBlock(speciesData = points, theRange = self$param_set$values$range, : 'k' is bigger than the number of the blocks

^{Created on 2019-09-03 by the reprex package (v0.3.0)}

Checkerboard pattern with spcv_block?

Dear mlr3spatiotempcv team,

First, many thanks for your hard work on this excellent resource.

I am having an issues producing a checkerboard sampling pattern using spcv_block. Instead of getting a checkerboard spatial partitioning, I always get something that looks more like a random sampling pattern. I have been successful creating a checkerboard pattern using the blockCV functions directly.

Here is a reproducible example that fails to produce a checkerboard sampling pattern:

library(blockCV)
library(mlr3)
library(mlr3spatiotempcv)

x <- runif(5000, -80.5, -75)
y <- runif(5000, 39.7, 42)

data <- data.frame(spp="test", 
                   label=factor(round(runif(length(x), 0, 1))),
                   x=x,
                   y=y)

testTask <- TaskClassifST$new(id = "test", 
                              backend = data, 
                              target = "label",
                              positive="1",
                              extra_args = list(coordinate_names=c("x", "y"),
                                                crs="EPSG: 4326"))

blockSamp <- rsmp("spcv_block",
                  folds=2,
                  range=50000,
                  selection="checkerboard")
blockSamp$instantiate(testTask)
autoplot(blockSamp, testTask)

Dataset: `cookfarm` for spatiotemporal example task

GSIF::cookfarm -> Regr Task #22

Though we do not need all 30k obs, just a subset.

Support "spatialAutoRange" option in "Spatial Block CV"

Maybe via a flag in spcv-block.

https://github.com/rvalavi/blockCV#bsic-usage

Temporal CV

I currently have a task with a column that is a date.
As the task is to basically predict values in the future, a cross-validation strategy that can take this into account would be required. Similar to see RollingWindowCV.
As this is a very common use-case, we should perhaps think about implementing this.

This is implemented in mlr3forecasting, but for forecasting tasks instead of regular Classif|Regr Tasks.
Where should such a method live? mlr3spatiotempcv ?
How would we go about implementing this.

Remove stratification param from all rsmp methods

Rathern than adding it and stopping stating that it is not supported 🙄

Cstf autoplot plots also omitted points

Whereas is should only plot train and test points of the respective plot.

Omitted points should be plotted optionally using a different color.