mlr-org / mlr3spatiotempcv Goto Github PK
View Code? Open in Web Editor NEWSpatiotemporal resampling methods for mlr3
Home Page: https://mlr3spatiotempcv.mlr-org.com
License: GNU Lesser General Public License v3.0
Spatiotemporal resampling methods for mlr3
Home Page: https://mlr3spatiotempcv.mlr-org.com
License: GNU Lesser General Public License v3.0
I've been at this issue for a while now and I figured I should report this and give some examples. In previous versions of mlr3 and associated packages, I was able to perform the following task:
I was attempting to carry that out again this week but I've hit quite the roadblock - it appears that tuning no longer plays nicely with the mlr3spatiotempcv package! Here is a reproducible example:
library(mlr3verse)
library(mlr3spatiotempcv)
task <- tsk("ecuador")
# This example uses the ranger package to do model and perform feature filtering
# In order to do this, pipeops need to be used
lrn <- lrn(
"classif.ranger",
num.threads = parallel::detectCores(),
importance = "impurity",
predict_type = "prob"
)
po_lrn <- po("learner", lrn)
# Create feature filter based on variable importance
po_filter <- po("filter", filter = mlr3filters::flt("importance", learner = lrn))
# Create process (new learner) for filtering the task
glrn <- GraphLearner$new(po_filter %>>% po_lrn)
glrn$predict_type <- "prob"
# Create filter parameters
param_set <- ParamSet$new(
params = list(ParamDbl$new("importance.filter.frac", lower = 0.1, upper = 1))
)
# Create filtering instance
instance <- TuningInstanceSingleCrit$new(
task = task,
learner = glrn,
resampling = rsmp("repeated_spcv_coords", folds = 10, repeats = 5),
measure = msr("classif.ce"),
search_space = param_set,
terminator = trm("none")
)
# Create tuner
tuner <- tnr("grid_search", resolution = 10)
tuner$optimize(instance)
I am a novice when it comes to using mlr3 and pipelines, so something in my code might be problematic but as far as I can see, the pipeline is correct. I think the issue comes with the tuning aspect of this though - when a filter fraction is defined in glrn
, the code executes the spatial cross validation properly:
glrn$param_set$values$importance.filter.frac = 0.3
rr <- resample(task, glrn, rsmp("repeated_spcv_coords", folds = 10, repeats = 5))
So I believe the issue to either be here or in the mlr3tuning package, not sure which so please redirect this issue if necessary. Thanks!
Just as CAST::CreateSpacetimeFolds()
does.
I am not sure if this approach can work with all currently implemented spatial sampling methods.
Even if not, we should support exactly this way of creating resamplings since some people already asked me exactly for this.
@HannaMeyer Is there a dedicated name for your method? If not, do you want to make a proposal? :)
You can have a look at the current names of the other methods in the README.
It seems that @jannes-m has added temporal extension support for spcv-coords
already.
Let's have a look how this works in detail.
Similar to https://mlr.mlr-org.com/reference/createSpatialResamplingPlots.html.
Should work with all implemented resampling methods.
If only space_var
is used, a 2D ggplot can be created similar to other spatial-only methods.
This method should also show the omitted points optionally.
Should TaskClassifST
and friends have entries in mlr_reflections$task_types
?
rsmp()
instead of mlr_resamplings$get()
Dear mlr3spatiotempcv team,
First, many thanks for your hard work on this excellent resource.
I am having an issues producing a checkerboard sampling pattern using spcv_block
. Instead of getting a checkerboard spatial partitioning, I always get something that looks more like a random sampling pattern. I have been successful creating a checkerboard pattern using the blockCV
functions directly.
Here is a reproducible example that fails to produce a checkerboard sampling pattern:
library(blockCV)
library(mlr3)
library(mlr3spatiotempcv)
x <- runif(5000, -80.5, -75)
y <- runif(5000, 39.7, 42)
data <- data.frame(spp="test",
label=factor(round(runif(length(x), 0, 1))),
x=x,
y=y)
testTask <- TaskClassifST$new(id = "test",
backend = data,
target = "label",
positive="1",
extra_args = list(coordinate_names=c("x", "y"),
crs="EPSG: 4326"))
blockSamp <- rsmp("spcv_block",
folds=2,
range=50000,
selection="checkerboard")
blockSamp$instantiate(testTask)
autoplot(blockSamp, testTask)
Prerequisite for spatial stuff: We need a place to store the coordinates in the task. This needs to be enabled in mlr3.
As mentioned here my vision is to make mlr3 THE place for spatial/spatio-temporal resampling methods (there are > 5 methods).
RepeatedResamplingCstf accidentally references the Cluto method.
Move redundant parts into helper funs
Consider using separate help files?
Dear spatio-temporal guys,
first of all, thanks for providing spatial cv via mlr3!!! I am playing around a bit with mlr3spatiotempcv. In the help file of TaskRegrST()
, it says that coordinates
should be a data.frame
when it fact, you need to provide a character string indicating the column names of the coordinates found in the backend
. A spatio-temporal example would be rather helpful indeed (as already pointed out in #22 and #24). If I can be of any help re the example, pls let me know.
If the target has a binary outcome, a presence-background approach (see blockCV::buffering) would be possible. Target needs to be transformed to 0/1 before sampling.
Similar to mlr3::ResamplingRepeatedCV
.
spcv-coords
spcv-env
spcv-block
spcv-buffer is LOOCV and has no repeats.
cc @jannes-m
It contains the cookfarm
example dataset.
Is this dataset available elsewhere?
https://github.com/envirometrix/landmap was mentioned as a replacement but it is not yet on CRAN.
library(mlr3spatiotemporal)
library(mlr3)
task <- tsk("ecuador")
# Instantiate Resampling
rcv <- rsmp("spcv-block")
rcv$param_set$values <- list(folds = 20)
rcv$instantiate(task)
#> Error in blockCV::spatialBlock(speciesData = points, theRange = self$param_set$values$range, : 'k' is bigger than the number of the blocks
Created on 2019-09-03 by the reprex package (v0.3.0)
GSIF::cookfarm
-> Regr Task #22
Though we do not need all 30k obs, just a subset.
Rathern than adding it and stopping stating that it is not supported 🙄
I was wondering if there was any planned support for sf dataframes to be used as task inputs? It may remove some of the arguments a user needs to provide when creating a task. A few useful functions that can be used to define some of the spatial inputs include:
# First, load some point data and sf library
data(meuse, package = "sp") # load data.frame from sp
library(sf)
x <- sf::st_as_sf(meuse, coords = c("x", "y"), crs = 28992)
# Generate the coordinate columns
sf::st_coordinates(x)
# Find the names of the coordinate columns
attr(x, "sf_column")
# Extract CRS information of the sf dataframe
sf::st_crs(x)
sf::st_crs(x)$epsg # Gets numeric EPSG code
sf::st_crs(x)$wkt # Gets WKT string
# Ensure that the geometry type is point
sf::st_geometry_type(x, by_geometry = FALSE) # or
all(sf::st_is(x, "POINT"))
# Remove geometry list column for data backend
x_df <- sf::st_drop_geometry(x)
# Keep coordinates as features in the data
cbind(x_df, sf::st_coordinates(x))
No worries if there is no planned support, I'm just curious and offering some solutions just in case! I use the sf
package quite a bit in my line of work for extracting raster covariate data, and have used these functions repeatedly. It's just a suggestion for more ease on the user end - rather than providing a dataframe with the x and y coordinates and having to specify the CRS and coordinate names from the dataframe, a user could simply provide the sf dataframe and specify whether to use coordinates as features.
Thanks for all the hard work on this package! I use it frequently and it works really well!
I currently have a task with a column that is a date.
As the task is to basically predict values in the future, a cross-validation strategy that can take this into account would be required. Similar to see RollingWindowCV.
As this is a very common use-case, we should perhaps think about implementing this.
mlr3forecasting
, but for forecasting tasks instead of regular Classif
|Regr
Tasks.The problem is that the user does not call instantiate()
actively in a Graph Learner. Hence, required arguments like time_var
for SptCV methods cannot be passed along.
They need to be populated from a field which is set during construction.
If we do so, we possibly can omit the need of arguments in instantiate()
completely?
Might work better with spacing between plots and label alignments?
Hi, I am trying to generate a task with my own data as it was shown in some tutorials for mlr. I do some random data and have a data.frame with coordinates. So that when I define the TaskRegr$new( .... , coordinates = coords) which results in:
Error in .subset2(public_bind_env, "initialize")(...) :
unused argument (coordinates = coords)
adding coordinates manually to the task does not work either. It would be great to have here a full example from some raw data of a regression problem where one wants to predict y from X and has lat, lon coords to consider in the CV.
Implement k-means clustering after Brenning2012 (https://ieeexplore.ieee.org/document/6352393)
as it is basically LOOCV
see #20
By passing down the ellipsis args of autoplot()
.
To match the naming scheme of mlr3 which uses underscores (e.g. repeated_cv
).
Currently mlr3spatiotempcv
overwrites col_roles
and row_roles
in zzz.R.
This makes it non-compatible with other packages, i.e. if I load a package that adds a different set of col_roles
, those are overwritten by mlr3spatiotempcv
. We should append instead of overwriting here.
The fact that the task_type
is non-unique after loading mlr3spatiotempcv
leads to tiny problems in mlr3pipelines
. We should discuss at a higher level how we expect packages in the mlr3verse
to behave here, as it is not completely clear how things should work here.
We will fix find a work-around in mlr3pipelines
for now.
Currently the error messages when using a wrong task with a spatial resampling methods do not really help with finding the error, this could be more descriptive.
library(mlr3); library(mlr3spatiotempcv);rsmp("spcv_coords")$instantiate(tsk("boston_housing"))
#> Error in as.matrix(x): attempt to apply non-function
Created on 2020-09-14 by the reprex package (v0.3.0)
I want to evaluate the performance of hyperparameters of a spatial datasets within spatial CV. Unfortunately, while non-spatial CV or bootstrap work in the mlr³tuning instance, spcv-coords
and repeated-spcv-coords
produce the error:
Error in benchmark_grid(self$task, learners, self$resampling) (mlr3_issue.R#43): Resampling is instantiated for a task with a different number of observations
I'm not sure whether the error occurred in my code (or in the idea of the spatial cross-validating hyperparameters) or the feature is not provided by mlr3spatiotempcv
. However even if I instantiate the resampling method to the task, the tuning instance reproduces the error. (tuning_resampling$instantiate(task)
)
A reproducible example from the ecuador dataset:
library("mlr3")
library("mlr3spatiotempcv")
library("mlr3tuning")
library("paradox")
task = tsk("ecuador")
learner = lrn("classif.rpart", predict_type = "prob")
# tune hyperparameter cp
param_set = ps(cp = p_dbl(lower = -5, upper = 0, trafo = function(x) 10^x))
# AUROC suitable for binary classification tasks
measure = msr("classif.auc")
# 10 evaluations
terminator = trm("evals", n_evals = 10)
# random search: best balance between computation time and search space grazing
tuner = tnr("random_search")
# inner resampling method
tuning_resampling = rsmp("spcv-coords", folds = 10)
# tuning_resampling$instantiate(task)
instance = TuningInstanceSingleCrit$new(
task = task,
learner = learner,
resampling = tuning_resampling,
measure = measure,
search_space = param_set,
terminator = terminator
)
tuner$optimize(instance)
#' Error in benchmark_grid(self$task, learners, self$resampling) (mlr3_issue.R#43):
#' Resampling is instantiated for a task with a different number of observations
From package blockCV: besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13107
Whereas is should only plot train and test points of the respective plot.
Omitted points should be plotted optionally using a different color.
Maybe via a flag in spcv-block
.
Dear mlr3 team,
first of all, thanks for your efforts in developing this extension package, it is very much appreciated.
I am trying to apply spatial CV using "spcv_coords" to an AutoTuner in order to retrieve nested resampling following the process described in the mlr3 book
RT.at_sp <- AutoTuner$new(
learner = reg.tree,
resampling = spatial_CV,
measure = opt.mse,
search_space = param_set_RT,
terminator = trm.evals,
tuner = tnr.GridSearch)
However, I end up with the error message:
"Error: Resampling 'spcv_coords' may not be instantiated".
The same error message remains, even if I try to instantiate the task manually beforehand using the command
spatial_CV$instantiate(sp_task)
as described in 2.5.2.
As I am not an expert, do I make something wrong, or is spatial CV not yet implemented for use with AutoTuner?
Thank you very much!
BR, Jürgen
Hi Patrick,
any reason why you restrict the class of the target column to either factor
or character
.
I have an ordered
factor as response, which fails in task creation with TaskClassifST$new()
,
but it works with TaskClassif$new()
mlr3spatiotempcv/R/TaskClassifST.R
Line 70 in 582d6f0
See also this issue, where the question was arised for mlr3:
mlr-org/mlr3#95
Here's a little reprex:
library(mlr3verse)
#> Loading required package: mlr3
#> Loading required package: mlr3filters
#> Loading required package: mlr3learners
#> Loading required package: mlr3pipelines
#> Loading required package: mlr3tuning
#> Loading required package: mlr3viz
#> Loading required package: paradox
# remotes::install_github("mlr-org/mlr3spatiotempcv")
library(mlr3spatiotempcv)
brew <- mapview::breweries
brew$number.seasonal.beers <- factor(brew$number.seasonal.beers, ordered = TRUE)
brew <- cbind(sf::st_drop_geometry(brew),
sf::st_coordinates(brew))
# task
task_nsb_o = TaskClassif$new(
id = "nsb",
backend = brew,
target = "number.seasonal.beers")
task_nsb_o
#> <TaskClassif:nsb> (224 x 10)
#> * Target: number.seasonal.beers
#> * Properties: multiclass
#> * Features (9):
#> - dbl (4): X, Y, founded, number.of.types
#> - chr (4): address, brewery, village, zipcode
#> - fct (1): state
# task ST - ordered
task_nsb_ST_o = TaskClassifST$new(
id = "nsbST_o",
backend = brew,
target = "number.seasonal.beers",
extra_args = list(
coordinate_names = c("X", "Y"),
coords_as_features = FALSE,
crs = "4326"))
#> Error: Target column 'number.seasonal.beers' must be a factor or character
# task ST - factor
brew$number.seasonal.beers <- factor(brew$number.seasonal.beers, ordered = FALSE)
task_nsb_ST_f = TaskClassifST$new(
id = "nsbST_f",
backend = brew,
target = "number.seasonal.beers",
extra_args = list(
coordinate_names = c("X", "Y"),
coords_as_features = FALSE,
crs = "4326"))
task_nsb_ST_f
#> <TaskClassifST:nsbST_f> (224 x 8)
#> * Target: number.seasonal.beers
#> * Properties: multiclass
#> * Features (7):
#> - chr (4): address, brewery, village, zipcode
#> - dbl (2): founded, number.of.types
#> - fct (1): state
#> * Coordinates:
#> X Y
#> 1: 10.88922 49.71979
#> 2: 11.23873 50.12579
#> 3: 10.85194 49.42080
#> 4: 10.07837 50.16197
#> 5: 9.97323 49.97720
#> ---
#> 220: 10.93073 50.12684
#> 221: 11.54562 50.07220
#> 222: 11.50372 50.01548
#> 223: 11.55831 49.98518
#> 224: 11.07389 50.06172
Created on 2020-11-09 by the reprex package (v0.3.0)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.