Giter Site home page Giter Site logo

mlr-org / mlr3spatiotempcv Goto Github PK

View Code? Open in Web Editor NEW
47.0 13.0 9.0 435.2 MB

Spatiotemporal resampling methods for mlr3

Home Page: https://mlr3spatiotempcv.mlr-org.com

License: GNU Lesser General Public License v3.0

R 29.57% TeX 70.43%
mlr3 resampling spatial cross-validation temporal resampling-methods r r-package

mlr3spatiotempcv's Introduction

mlr3spatiotempcv

Package website: release | dev

Spatiotemporal resampling methods for mlr3.

tic CRAN Status Coverage status Lifecycle: stable CodeFactor

This package extends the mlr3 package framework with spatiotemporal resampling and visualization methods.

If you prefer the tidymodels ecosystem, have a look at the {spatialsample} package for spatial sampling methods.

Installation

CRAN version

install.packages("mlr3spatiotempcv")

Development version

remotes::install_github("mlr-org/mlr3spatiotempcv")

# R Universe Repo
install.packages('mlr3spatiotempcv', mlrorg = 'https://mlr-org.r-universe.dev')

Get Started

See the "Get Started" vignette for a quick introduction.

For more detailed information including an usage example see the "Spatiotemporal Analysis" chapter in the mlr3book.

Article "Spatiotemporal Visualization" shows how 3D subplots grids can be created.

Citation

To cite the package in publications, use the output of citation("mlr3spatiotempcv").

Resources

Other spatiotemporal resampling packages

This list does not claim to be comprehensive.

(Disclaimer: Because CRAN does not like DOI URLs in their automated checks, direct linking to scientific articles is not possible...)

Name Language Resources
blockCV R CRAN
CAST R Paper, CRAN
ENMeval R CRAN
spatialsample R CRAN
sperrorest R CRAN
Pyspatialml Python GitHub
spacv Python GitHub
Museo Toolbox Python Paper, GitHub
spatial-kfold Python GitHub

FAQ

Which resampling method should I use?
There is no single-best resampling method. It depends on your dataset characteristics and what your model should is about to predict on. The resampling scheme should reflect the final purpose of the model - this concept is called "target-oriented" resampling. For example, if the model was trained on multiple forest plots and its purpose is to predict something on unknown forest stands, the resampling structure should reflect this.
Are there more resampling methods than the one {mlr3spatiotempcv} offers?
{mlr3spatiotempcv} aims to offer all resampling methods that exist in R. Though this does not mean that it covers all resampling methods. If there are some that you are missing, feel free to open an issue.
How can I use the "blocking" concept of the old {mlr}?
This concept is now supported via the "column roles" concept available in {mlr3} [Task](https://mlr3.mlr-org.com/reference/Task.html) objects. See [this documentation](https://mlr3.mlr-org.com/reference/Resampling.html#grouping-blocking) for more information.
For the methods that offer buffering, how can an appropriate value be chosen?
There is no easy answer to this question. Buffering train and test sets reduces the similarity between both. The degree of this reduction depends on the dataset itself and there is no general approach how to choosen an appropriate buffer size. Some studies used the distance at which the autocorrelation levels off. This buffer distance often removes quite a lot of observations and needs to be calculated first.

mlr3spatiotempcv's People

Contributors

alexanderbrenning avatar be-marc avatar github-actions[bot] avatar jannes-m avatar jawond avatar lorenzwalthert avatar mb706 avatar mllg avatar pat-s avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlr3spatiotempcv's Issues

Support for sf package

I was wondering if there was any planned support for sf dataframes to be used as task inputs? It may remove some of the arguments a user needs to provide when creating a task. A few useful functions that can be used to define some of the spatial inputs include:

# First, load some point data and sf library
data(meuse, package = "sp") # load data.frame from sp
library(sf)
x <- sf::st_as_sf(meuse, coords = c("x", "y"), crs = 28992)

# Generate the coordinate columns
sf::st_coordinates(x) 

# Find the names of the coordinate columns
attr(x, "sf_column")

# Extract CRS information of the sf dataframe
sf::st_crs(x)
sf::st_crs(x)$epsg # Gets numeric EPSG code
sf::st_crs(x)$wkt # Gets WKT string

# Ensure that the geometry type is point
sf::st_geometry_type(x, by_geometry = FALSE) # or
all(sf::st_is(x, "POINT"))

# Remove geometry list column for data backend
x_df <- sf::st_drop_geometry(x)

# Keep coordinates as features in the data
cbind(x_df, sf::st_coordinates(x))

No worries if there is no planned support, I'm just curious and offering some solutions just in case! I use the sf package quite a bit in my line of work for extracting raster covariate data, and have used these functions repeatedly. It's just a suggestion for more ease on the user end - rather than providing a dataframe with the x and y coordinates and having to specify the CRS and coordinate names from the dataframe, a user could simply provide the sf dataframe and specify whether to use coordinates as features.

Thanks for all the hard work on this package! I use it frequently and it works really well!

Coordinates in the mlr3 Task object

Hi, I am trying to generate a task with my own data as it was shown in some tutorials for mlr. I do some random data and have a data.frame with coordinates. So that when I define the TaskRegr$new( .... , coordinates = coords) which results in:

Error in .subset2(public_bind_env, "initialize")(...) :
unused argument (coordinates = coords)

adding coordinates manually to the task does not work either. It would be great to have here a full example from some raw data of a regression problem where one wants to predict y from X and has lat, lon coords to consider in the CV.

Add 2D plot method for Cstf method

If only space_var is used, a 2D ggplot can be created similar to other spatial-only methods.

This method should also show the omitted points optionally.

Spatial CV failed with mlr3tuning

I want to evaluate the performance of hyperparameters of a spatial datasets within spatial CV. Unfortunately, while non-spatial CV or bootstrap work in the mlr³tuning instance, spcv-coords and repeated-spcv-coords produce the error:

Error in benchmark_grid(self$task, learners, self$resampling) (mlr3_issue.R#43): Resampling is instantiated for a task with a different number of observations

I'm not sure whether the error occurred in my code (or in the idea of the spatial cross-validating hyperparameters) or the feature is not provided by mlr3spatiotempcv. However even if I instantiate the resampling method to the task, the tuning instance reproduces the error. (tuning_resampling$instantiate(task))
A reproducible example from the ecuador dataset:

library("mlr3")
library("mlr3spatiotempcv")
library("mlr3tuning")
library("paradox")

task = tsk("ecuador")
learner = lrn("classif.rpart", predict_type = "prob")

# tune hyperparameter cp
param_set = ps(cp = p_dbl(lower = -5, upper = 0, trafo = function(x) 10^x))

# AUROC suitable for binary classification tasks
measure = msr("classif.auc")

# 10 evaluations
terminator = trm("evals", n_evals = 10)

# random search: best balance between computation time and search space grazing
tuner = tnr("random_search")

# inner resampling method
tuning_resampling = rsmp("spcv-coords", folds = 10)
# tuning_resampling$instantiate(task)

instance = TuningInstanceSingleCrit$new(
    task = task,
    learner = learner,
    resampling = tuning_resampling,
    measure = measure,
    search_space = param_set,
    terminator = terminator
)

tuner$optimize(instance)

#' Error in benchmark_grid(self$task, learners, self$resampling) (mlr3_issue.R#43):
#' Resampling is instantiated for a task with a different number of observations

make spatiotempcv compatible with other packages

  • Currently mlr3spatiotempcv overwrites col_roles and row_roles in zzz.R.
    This makes it non-compatible with other packages, i.e. if I load a package that adds a different set of col_roles, those are overwritten by mlr3spatiotempcv. We should append instead of overwriting here.

  • The fact that the task_type is non-unique after loading mlr3spatiotempcv leads to tiny problems in mlr3pipelines. We should discuss at a higher level how we expect packages in the mlr3verse to behave here, as it is not completely clear how things should work here.
    We will fix find a work-around in mlr3pipelines for now.

Resampling: Add spatial methods

Prerequisite for spatial stuff: We need a place to store the coordinates in the task. This needs to be enabled in mlr3.

As mentioned here my vision is to make mlr3 THE place for spatial/spatio-temporal resampling methods (there are > 5 methods).

TaskClassifST fails with ordered factor

Hi Patrick,

any reason why you restrict the class of the target column to either factor or character.
I have an ordered factor as response, which fails in task creation with TaskClassifST$new(),
but it works with TaskClassif$new()

if (info$type %nin% c("factor", "character")) {

See also this issue, where the question was arised for mlr3:
mlr-org/mlr3#95

Here's a little reprex:

library(mlr3verse)
#> Loading required package: mlr3
#> Loading required package: mlr3filters
#> Loading required package: mlr3learners
#> Loading required package: mlr3pipelines
#> Loading required package: mlr3tuning
#> Loading required package: mlr3viz
#> Loading required package: paradox

# remotes::install_github("mlr-org/mlr3spatiotempcv")
library(mlr3spatiotempcv)

brew <- mapview::breweries
brew$number.seasonal.beers <- factor(brew$number.seasonal.beers, ordered = TRUE)

brew <- cbind(sf::st_drop_geometry(brew), 
              sf::st_coordinates(brew))

# task 
task_nsb_o = TaskClassif$new(
  id = "nsb",
  backend = brew, 
  target = "number.seasonal.beers")

task_nsb_o
#> <TaskClassif:nsb> (224 x 10)
#> * Target: number.seasonal.beers
#> * Properties: multiclass
#> * Features (9):
#>   - dbl (4): X, Y, founded, number.of.types
#>   - chr (4): address, brewery, village, zipcode
#>   - fct (1): state

# task ST - ordered
task_nsb_ST_o = TaskClassifST$new(
  id = "nsbST_o",
  backend = brew, 
  target = "number.seasonal.beers", 
  extra_args = list(
    coordinate_names = c("X", "Y"),
    coords_as_features = FALSE,
    crs = "4326"))
#> Error: Target column 'number.seasonal.beers' must be a factor or character

# task ST - factor
brew$number.seasonal.beers <- factor(brew$number.seasonal.beers, ordered = FALSE)
task_nsb_ST_f = TaskClassifST$new(
  id = "nsbST_f",
  backend = brew, 
  target = "number.seasonal.beers", 
  extra_args = list(
    coordinate_names = c("X", "Y"),
    coords_as_features = FALSE,
    crs = "4326"))

task_nsb_ST_f
#> <TaskClassifST:nsbST_f> (224 x 8)
#> * Target: number.seasonal.beers
#> * Properties: multiclass
#> * Features (7):
#>   - chr (4): address, brewery, village, zipcode
#>   - dbl (2): founded, number.of.types
#>   - fct (1): state
#> * Coordinates:
#>             X        Y
#>   1: 10.88922 49.71979
#>   2: 11.23873 50.12579
#>   3: 10.85194 49.42080
#>   4: 10.07837 50.16197
#>   5:  9.97323 49.97720
#>  ---                  
#> 220: 10.93073 50.12684
#> 221: 11.54562 50.07220
#> 222: 11.50372 50.01548
#> 223: 11.55831 49.98518
#> 224: 11.07389 50.06172

Created on 2020-11-09 by the reprex package (v0.3.0)

More descriptive / informative error messages for wrong task

Currently the error messages when using a wrong task with a spatial resampling methods do not really help with finding the error, this could be more descriptive.

library(mlr3); library(mlr3spatiotempcv);rsmp("spcv_coords")$instantiate(tsk("boston_housing"))
#> Error in as.matrix(x): attempt to apply non-function

Created on 2020-09-14 by the reprex package (v0.3.0)

Move arguments of `instantiate()` to constructor?

The problem is that the user does not call instantiate() actively in a Graph Learner. Hence, required arguments like time_var for SptCV methods cannot be passed along.

They need to be populated from a field which is set during construction.
If we do so, we possibly can omit the need of arguments in instantiate() completely?

Model tuning does not work with mlr3spatiotempcv package

I've been at this issue for a while now and I figured I should report this and give some examples. In previous versions of mlr3 and associated packages, I was able to perform the following task:

  1. Perform feature filtering on a dataset using variable importance filters (i.e.: tuning)
  2. Construct a repeated spatial cross validation model using the filtered dataset
  3. Select the best filtered model

I was attempting to carry that out again this week but I've hit quite the roadblock - it appears that tuning no longer plays nicely with the mlr3spatiotempcv package! Here is a reproducible example:

library(mlr3verse)
library(mlr3spatiotempcv)

task <- tsk("ecuador")

# This example uses the ranger package to do model and perform feature filtering
# In order to do this, pipeops need to be used
lrn <- lrn(
  "classif.ranger", 
  num.threads = parallel::detectCores(),
  importance = "impurity",
  predict_type = "prob"
)
po_lrn <- po("learner", lrn)

# Create feature filter based on variable importance
po_filter <- po("filter", filter = mlr3filters::flt("importance", learner = lrn))

# Create process (new learner) for filtering the task
glrn <- GraphLearner$new(po_filter %>>% po_lrn)
glrn$predict_type <- "prob"

# Create filter parameters 
param_set <- ParamSet$new(
  params = list(ParamDbl$new("importance.filter.frac", lower = 0.1, upper = 1))
)

# Create filtering instance
instance <- TuningInstanceSingleCrit$new(
  task = task,
  learner = glrn, 
  resampling =  rsmp("repeated_spcv_coords", folds = 10, repeats = 5), 
  measure = msr("classif.ce"),
  search_space = param_set, 
  terminator = trm("none")
)

# Create tuner
tuner <- tnr("grid_search", resolution = 10)
tuner$optimize(instance)

I am a novice when it comes to using mlr3 and pipelines, so something in my code might be problematic but as far as I can see, the pipeline is correct. I think the issue comes with the tuning aspect of this though - when a filter fraction is defined in glrn, the code executes the spatial cross validation properly:

glrn$param_set$values$importance.filter.frac = 0.3
rr <- resample(task, glrn, rsmp("repeated_spcv_coords", folds = 10, repeats = 5))

So I believe the issue to either be here or in the mlr3tuning package, not sure which so please redirect this issue if necessary. Thanks!

Support resampling method based on predefined spatiotemporal groups

Just as CAST::CreateSpacetimeFolds() does.

I am not sure if this approach can work with all currently implemented spatial sampling methods.
Even if not, we should support exactly this way of creating resamplings since some people already asked me exactly for this.
@HannaMeyer Is there a dedicated name for your method? If not, do you want to make a proposal? :)
You can have a look at the current names of the other methods in the README.

It seems that @jannes-m has added temporal extension support for spcv-coords already.
Let's have a look how this works in detail.

Instantiate spcv_coords for AutoTuner

Dear mlr3 team,

first of all, thanks for your efforts in developing this extension package, it is very much appreciated.

I am trying to apply spatial CV using "spcv_coords" to an AutoTuner in order to retrieve nested resampling following the process described in the mlr3 book

        RT.at_sp <- AutoTuner$new(
          learner = reg.tree,
          resampling = spatial_CV, 
          measure = opt.mse,
          search_space = param_set_RT,
          terminator = trm.evals,
          tuner = tnr.GridSearch)

However, I end up with the error message:

        "Error: Resampling 'spcv_coords' may not be instantiated". 

The same error message remains, even if I try to instantiate the task manually beforehand using the command

        spatial_CV$instantiate(sp_task)

as described in 2.5.2.

As I am not an expert, do I make something wrong, or is spatial CV not yet implemented for use with AutoTuner?

Thank you very much!
BR, Jürgen

misleading description of coordinates type in TaskRegrST

Dear spatio-temporal guys,
first of all, thanks for providing spatial cv via mlr3!!! I am playing around a bit with mlr3spatiotempcv. In the help file of TaskRegrST(), it says that coordinates should be a data.frame when it fact, you need to provide a character string indicating the column names of the coordinates found in the backend. A spatio-temporal example would be rather helpful indeed (as already pointed out in #22 and #24). If I can be of any help re the example, pls let me know.

Inspect "'k' is bigger than the number of the blocks" error

library(mlr3spatiotemporal)

library(mlr3)
task <- tsk("ecuador")

# Instantiate Resampling
rcv <- rsmp("spcv-block")
rcv$param_set$values <- list(folds = 20)
rcv$instantiate(task)
#> Error in blockCV::spatialBlock(speciesData = points, theRange = self$param_set$values$range, : 'k' is bigger than the number of the blocks

Created on 2019-09-03 by the reprex package (v0.3.0)

Checkerboard pattern with spcv_block?

Dear mlr3spatiotempcv team,

First, many thanks for your hard work on this excellent resource.

I am having an issues producing a checkerboard sampling pattern using spcv_block. Instead of getting a checkerboard spatial partitioning, I always get something that looks more like a random sampling pattern. I have been successful creating a checkerboard pattern using the blockCV functions directly.

Here is a reproducible example that fails to produce a checkerboard sampling pattern:

library(blockCV)
library(mlr3)
library(mlr3spatiotempcv)

x <- runif(5000, -80.5, -75)
y <- runif(5000, 39.7, 42)

data <- data.frame(spp="test", 
                   label=factor(round(runif(length(x), 0, 1))),
                   x=x,
                   y=y)

testTask <- TaskClassifST$new(id = "test", 
                              backend = data, 
                              target = "label",
                              positive="1",
                              extra_args = list(coordinate_names=c("x", "y"),
                                                crs="EPSG: 4326"))

blockSamp <- rsmp("spcv_block",
                  folds=2,
                  range=50000,
                  selection="checkerboard")
blockSamp$instantiate(testTask)
autoplot(blockSamp, testTask)

Rplot01

Temporal CV

I currently have a task with a column that is a date.
As the task is to basically predict values in the future, a cross-validation strategy that can take this into account would be required. Similar to see RollingWindowCV.
As this is a very common use-case, we should perhaps think about implementing this.

  • This is implemented in mlr3forecasting, but for forecasting tasks instead of regular Classif|Regr Tasks.
  • Where should such a method live? mlr3spatiotempcv ?
  • How would we go about implementing this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.