Giter Site home page Giter Site logo

mlr-org / mlr3spatiotempcv Goto Github PK

View Code? Open in Web Editor NEW
47.0 13.0 9.0 441.27 MB

Spatiotemporal resampling methods for mlr3

Home Page: https://mlr3spatiotempcv.mlr-org.com

License: GNU Lesser General Public License v3.0

R 29.57% TeX 70.43%
mlr3 resampling spatial cross-validation temporal resampling-methods r r-package

mlr3spatiotempcv's Issues

Model tuning does not work with mlr3spatiotempcv package

I've been at this issue for a while now and I figured I should report this and give some examples. In previous versions of mlr3 and associated packages, I was able to perform the following task:

  1. Perform feature filtering on a dataset using variable importance filters (i.e.: tuning)
  2. Construct a repeated spatial cross validation model using the filtered dataset
  3. Select the best filtered model

I was attempting to carry that out again this week but I've hit quite the roadblock - it appears that tuning no longer plays nicely with the mlr3spatiotempcv package! Here is a reproducible example:

library(mlr3verse)
library(mlr3spatiotempcv)

task <- tsk("ecuador")

# This example uses the ranger package to do model and perform feature filtering
# In order to do this, pipeops need to be used
lrn <- lrn(
  "classif.ranger", 
  num.threads = parallel::detectCores(),
  importance = "impurity",
  predict_type = "prob"
)
po_lrn <- po("learner", lrn)

# Create feature filter based on variable importance
po_filter <- po("filter", filter = mlr3filters::flt("importance", learner = lrn))

# Create process (new learner) for filtering the task
glrn <- GraphLearner$new(po_filter %>>% po_lrn)
glrn$predict_type <- "prob"

# Create filter parameters 
param_set <- ParamSet$new(
  params = list(ParamDbl$new("importance.filter.frac", lower = 0.1, upper = 1))
)

# Create filtering instance
instance <- TuningInstanceSingleCrit$new(
  task = task,
  learner = glrn, 
  resampling =  rsmp("repeated_spcv_coords", folds = 10, repeats = 5), 
  measure = msr("classif.ce"),
  search_space = param_set, 
  terminator = trm("none")
)

# Create tuner
tuner <- tnr("grid_search", resolution = 10)
tuner$optimize(instance)

I am a novice when it comes to using mlr3 and pipelines, so something in my code might be problematic but as far as I can see, the pipeline is correct. I think the issue comes with the tuning aspect of this though - when a filter fraction is defined in glrn, the code executes the spatial cross validation properly:

glrn$param_set$values$importance.filter.frac = 0.3
rr <- resample(task, glrn, rsmp("repeated_spcv_coords", folds = 10, repeats = 5))

So I believe the issue to either be here or in the mlr3tuning package, not sure which so please redirect this issue if necessary. Thanks!

Support resampling method based on predefined spatiotemporal groups

Just as CAST::CreateSpacetimeFolds() does.

I am not sure if this approach can work with all currently implemented spatial sampling methods.
Even if not, we should support exactly this way of creating resamplings since some people already asked me exactly for this.
@HannaMeyer Is there a dedicated name for your method? If not, do you want to make a proposal? :)
You can have a look at the current names of the other methods in the README.

It seems that @jannes-m has added temporal extension support for spcv-coords already.
Let's have a look how this works in detail.

Add 2D plot method for Cstf method

If only space_var is used, a 2D ggplot can be created similar to other spatial-only methods.

This method should also show the omitted points optionally.

Checkerboard pattern with spcv_block?

Dear mlr3spatiotempcv team,

First, many thanks for your hard work on this excellent resource.

I am having an issues producing a checkerboard sampling pattern using spcv_block. Instead of getting a checkerboard spatial partitioning, I always get something that looks more like a random sampling pattern. I have been successful creating a checkerboard pattern using the blockCV functions directly.

Here is a reproducible example that fails to produce a checkerboard sampling pattern:

library(blockCV)
library(mlr3)
library(mlr3spatiotempcv)

x <- runif(5000, -80.5, -75)
y <- runif(5000, 39.7, 42)

data <- data.frame(spp="test", 
                   label=factor(round(runif(length(x), 0, 1))),
                   x=x,
                   y=y)

testTask <- TaskClassifST$new(id = "test", 
                              backend = data, 
                              target = "label",
                              positive="1",
                              extra_args = list(coordinate_names=c("x", "y"),
                                                crs="EPSG: 4326"))

blockSamp <- rsmp("spcv_block",
                  folds=2,
                  range=50000,
                  selection="checkerboard")
blockSamp$instantiate(testTask)
autoplot(blockSamp, testTask)

Rplot01

Resampling: Add spatial methods

Prerequisite for spatial stuff: We need a place to store the coordinates in the task. This needs to be enabled in mlr3.

As mentioned here my vision is to make mlr3 THE place for spatial/spatio-temporal resampling methods (there are > 5 methods).

misleading description of coordinates type in TaskRegrST

Dear spatio-temporal guys,
first of all, thanks for providing spatial cv via mlr3!!! I am playing around a bit with mlr3spatiotempcv. In the help file of TaskRegrST(), it says that coordinates should be a data.frame when it fact, you need to provide a character string indicating the column names of the coordinates found in the backend. A spatio-temporal example would be rather helpful indeed (as already pointed out in #22 and #24). If I can be of any help re the example, pls let me know.

Inspect "'k' is bigger than the number of the blocks" error

library(mlr3spatiotemporal)

library(mlr3)
task <- tsk("ecuador")

# Instantiate Resampling
rcv <- rsmp("spcv-block")
rcv$param_set$values <- list(folds = 20)
rcv$instantiate(task)
#> Error in blockCV::spatialBlock(speciesData = points, theRange = self$param_set$values$range, : 'k' is bigger than the number of the blocks

Created on 2019-09-03 by the reprex package (v0.3.0)

Support for sf package

I was wondering if there was any planned support for sf dataframes to be used as task inputs? It may remove some of the arguments a user needs to provide when creating a task. A few useful functions that can be used to define some of the spatial inputs include:

# First, load some point data and sf library
data(meuse, package = "sp") # load data.frame from sp
library(sf)
x <- sf::st_as_sf(meuse, coords = c("x", "y"), crs = 28992)

# Generate the coordinate columns
sf::st_coordinates(x) 

# Find the names of the coordinate columns
attr(x, "sf_column")

# Extract CRS information of the sf dataframe
sf::st_crs(x)
sf::st_crs(x)$epsg # Gets numeric EPSG code
sf::st_crs(x)$wkt # Gets WKT string

# Ensure that the geometry type is point
sf::st_geometry_type(x, by_geometry = FALSE) # or
all(sf::st_is(x, "POINT"))

# Remove geometry list column for data backend
x_df <- sf::st_drop_geometry(x)

# Keep coordinates as features in the data
cbind(x_df, sf::st_coordinates(x))

No worries if there is no planned support, I'm just curious and offering some solutions just in case! I use the sf package quite a bit in my line of work for extracting raster covariate data, and have used these functions repeatedly. It's just a suggestion for more ease on the user end - rather than providing a dataframe with the x and y coordinates and having to specify the CRS and coordinate names from the dataframe, a user could simply provide the sf dataframe and specify whether to use coordinates as features.

Thanks for all the hard work on this package! I use it frequently and it works really well!

Temporal CV

I currently have a task with a column that is a date.
As the task is to basically predict values in the future, a cross-validation strategy that can take this into account would be required. Similar to see RollingWindowCV.
As this is a very common use-case, we should perhaps think about implementing this.

  • This is implemented in mlr3forecasting, but for forecasting tasks instead of regular Classif|Regr Tasks.
  • Where should such a method live? mlr3spatiotempcv ?
  • How would we go about implementing this.

Move arguments of `instantiate()` to constructor?

The problem is that the user does not call instantiate() actively in a Graph Learner. Hence, required arguments like time_var for SptCV methods cannot be passed along.

They need to be populated from a field which is set during construction.
If we do so, we possibly can omit the need of arguments in instantiate() completely?

Coordinates in the mlr3 Task object

Hi, I am trying to generate a task with my own data as it was shown in some tutorials for mlr. I do some random data and have a data.frame with coordinates. So that when I define the TaskRegr$new( .... , coordinates = coords) which results in:

Error in .subset2(public_bind_env, "initialize")(...) :
unused argument (coordinates = coords)

adding coordinates manually to the task does not work either. It would be great to have here a full example from some raw data of a regression problem where one wants to predict y from X and has lat, lon coords to consider in the CV.

make spatiotempcv compatible with other packages

  • Currently mlr3spatiotempcv overwrites col_roles and row_roles in zzz.R.
    This makes it non-compatible with other packages, i.e. if I load a package that adds a different set of col_roles, those are overwritten by mlr3spatiotempcv. We should append instead of overwriting here.

  • The fact that the task_type is non-unique after loading mlr3spatiotempcv leads to tiny problems in mlr3pipelines. We should discuss at a higher level how we expect packages in the mlr3verse to behave here, as it is not completely clear how things should work here.
    We will fix find a work-around in mlr3pipelines for now.

More descriptive / informative error messages for wrong task

Currently the error messages when using a wrong task with a spatial resampling methods do not really help with finding the error, this could be more descriptive.

library(mlr3); library(mlr3spatiotempcv);rsmp("spcv_coords")$instantiate(tsk("boston_housing"))
#> Error in as.matrix(x): attempt to apply non-function

Created on 2020-09-14 by the reprex package (v0.3.0)

Spatial CV failed with mlr3tuning

I want to evaluate the performance of hyperparameters of a spatial datasets within spatial CV. Unfortunately, while non-spatial CV or bootstrap work in the mlr³tuning instance, spcv-coords and repeated-spcv-coords produce the error:

Error in benchmark_grid(self$task, learners, self$resampling) (mlr3_issue.R#43): Resampling is instantiated for a task with a different number of observations

I'm not sure whether the error occurred in my code (or in the idea of the spatial cross-validating hyperparameters) or the feature is not provided by mlr3spatiotempcv. However even if I instantiate the resampling method to the task, the tuning instance reproduces the error. (tuning_resampling$instantiate(task))
A reproducible example from the ecuador dataset:

library("mlr3")
library("mlr3spatiotempcv")
library("mlr3tuning")
library("paradox")

task = tsk("ecuador")
learner = lrn("classif.rpart", predict_type = "prob")

# tune hyperparameter cp
param_set = ps(cp = p_dbl(lower = -5, upper = 0, trafo = function(x) 10^x))

# AUROC suitable for binary classification tasks
measure = msr("classif.auc")

# 10 evaluations
terminator = trm("evals", n_evals = 10)

# random search: best balance between computation time and search space grazing
tuner = tnr("random_search")

# inner resampling method
tuning_resampling = rsmp("spcv-coords", folds = 10)
# tuning_resampling$instantiate(task)

instance = TuningInstanceSingleCrit$new(
    task = task,
    learner = learner,
    resampling = tuning_resampling,
    measure = measure,
    search_space = param_set,
    terminator = terminator
)

tuner$optimize(instance)

#' Error in benchmark_grid(self$task, learners, self$resampling) (mlr3_issue.R#43):
#' Resampling is instantiated for a task with a different number of observations

Instantiate spcv_coords for AutoTuner

Dear mlr3 team,

first of all, thanks for your efforts in developing this extension package, it is very much appreciated.

I am trying to apply spatial CV using "spcv_coords" to an AutoTuner in order to retrieve nested resampling following the process described in the mlr3 book

        RT.at_sp <- AutoTuner$new(
          learner = reg.tree,
          resampling = spatial_CV, 
          measure = opt.mse,
          search_space = param_set_RT,
          terminator = trm.evals,
          tuner = tnr.GridSearch)

However, I end up with the error message:

        "Error: Resampling 'spcv_coords' may not be instantiated". 

The same error message remains, even if I try to instantiate the task manually beforehand using the command

        spatial_CV$instantiate(sp_task)

as described in 2.5.2.

As I am not an expert, do I make something wrong, or is spatial CV not yet implemented for use with AutoTuner?

Thank you very much!
BR, Jürgen

TaskClassifST fails with ordered factor

Hi Patrick,

any reason why you restrict the class of the target column to either factor or character.
I have an ordered factor as response, which fails in task creation with TaskClassifST$new(),
but it works with TaskClassif$new()

if (info$type %nin% c("factor", "character")) {

See also this issue, where the question was arised for mlr3:
mlr-org/mlr3#95

Here's a little reprex:

library(mlr3verse)
#> Loading required package: mlr3
#> Loading required package: mlr3filters
#> Loading required package: mlr3learners
#> Loading required package: mlr3pipelines
#> Loading required package: mlr3tuning
#> Loading required package: mlr3viz
#> Loading required package: paradox

# remotes::install_github("mlr-org/mlr3spatiotempcv")
library(mlr3spatiotempcv)

brew <- mapview::breweries
brew$number.seasonal.beers <- factor(brew$number.seasonal.beers, ordered = TRUE)

brew <- cbind(sf::st_drop_geometry(brew), 
              sf::st_coordinates(brew))

# task 
task_nsb_o = TaskClassif$new(
  id = "nsb",
  backend = brew, 
  target = "number.seasonal.beers")

task_nsb_o
#> <TaskClassif:nsb> (224 x 10)
#> * Target: number.seasonal.beers
#> * Properties: multiclass
#> * Features (9):
#>   - dbl (4): X, Y, founded, number.of.types
#>   - chr (4): address, brewery, village, zipcode
#>   - fct (1): state

# task ST - ordered
task_nsb_ST_o = TaskClassifST$new(
  id = "nsbST_o",
  backend = brew, 
  target = "number.seasonal.beers", 
  extra_args = list(
    coordinate_names = c("X", "Y"),
    coords_as_features = FALSE,
    crs = "4326"))
#> Error: Target column 'number.seasonal.beers' must be a factor or character

# task ST - factor
brew$number.seasonal.beers <- factor(brew$number.seasonal.beers, ordered = FALSE)
task_nsb_ST_f = TaskClassifST$new(
  id = "nsbST_f",
  backend = brew, 
  target = "number.seasonal.beers", 
  extra_args = list(
    coordinate_names = c("X", "Y"),
    coords_as_features = FALSE,
    crs = "4326"))

task_nsb_ST_f
#> <TaskClassifST:nsbST_f> (224 x 8)
#> * Target: number.seasonal.beers
#> * Properties: multiclass
#> * Features (7):
#>   - chr (4): address, brewery, village, zipcode
#>   - dbl (2): founded, number.of.types
#>   - fct (1): state
#> * Coordinates:
#>             X        Y
#>   1: 10.88922 49.71979
#>   2: 11.23873 50.12579
#>   3: 10.85194 49.42080
#>   4: 10.07837 50.16197
#>   5:  9.97323 49.97720
#>  ---                  
#> 220: 10.93073 50.12684
#> 221: 11.54562 50.07220
#> 222: 11.50372 50.01548
#> 223: 11.55831 49.98518
#> 224: 11.07389 50.06172

Created on 2020-11-09 by the reprex package (v0.3.0)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.