Skip to contents

This function assign modelling input data into spatial-block cross-validation folds using three strategies (see below) using blockCV::cv_spatial. The function is planned to be used inside the Mod_Prep4HPC function.

Usage

Mod_GetCV(
  Data = NULL,
  EnvFile = ".env",
  XVars = NULL,
  CV_NFolds = 4L,
  CV_NGrids = 20L,
  CV_NR = 2,
  CV_NC = 2L,
  CV_SAC = FALSE,
  OutPath = NULL,
  CV_Plot = TRUE
)

Arguments

Data

data.frame. A data frame or tibble containing the input dataset. This data frame should include two columns for x and y coordinates as long as other columns matching the names of predictors listed in XVars argument. This argument is mandatory and can not be empty.

EnvFile

Character. Path to the environment file containing paths to data sources. Defaults to .env.

XVars

Character vector. Variables to be used in the model. This argument is mandatory and can not be empty.

CV_NFolds

Integer. Number of cross-validation folds. Default: 4L.

CV_NGrids

Integer. Number of grid cells in both directions used in the CV_Dist cross-validation strategy (see below). Default: 20L.

CV_NR, CV_NC

Integer. Number of rows and columns used in the CV_Large cross-validation strategy (see below), in which the study area is divided into large blocks given the provided CV_NR and CV_NC values. Both default to 2L which means to split the study area into four large blocks at the median latitude and longitude.

CV_SAC

Logical. Whether to use the spatial autocorrelation to determine the block size. Defaults to FALSE,

OutPath

Character. Path for directory to save the cross-validation results. This argument is mandatory and can not be empty.

CV_Plot

Logical. Indicating whether to plot the block cross-validation folds.

Value

The function returns a modified version of the input dataset with additional numeric columns (integer) indicating the cross-validation strategy used.

Note

The function uses the following cross-validation strategies:

  • CV_Dist in which the size of spatial cross-validation blocks is determined by the CV_NGrids argument. The default CV_NGrids value is 20L, which means blocks of 20×20 grid cell each.

  • CV_Large which splits the study area into large blocks, as determined by the CV_NR and CV_NC arguments. if CV_NR = CV_NC = 2L (default), four large blocks will be used, split the study area at the median coordinates.

  • CV_SAC in which the size of the blocks is determined by the median spatial autocorrelation range in the predictor data (estimated using blockCV::cv_spatial_autocor). This requires the availability of the automap R package. This strategy is currently skipped by default.

Author

Ahmed El-Gabbas