Skip to contents

This function assign modelling input data into spatial-block cross-validation folds using three strategies (see below) using blockCV::cv_spatial. The function is planned to be used inside the Mod_Prep4HPC function.

Usage

GetCV(
  DT,
  EnvFile = ".env",
  XVars,
  CV_NFolds = 4,
  CV_NGrids = 20,
  CV_NR = 2,
  CV_NC = 2,
  CV_SAC = FALSE,
  OutPath = NULL,
  FromHPC = TRUE,
  CV_Plot = TRUE
)

Arguments

DT

A data frame or tibble containing the input dataset. This data frame should include two columns for x and y coordinates as long as other columns matching the names of predictors listed in XVars argument.

EnvFile

String specifying the path to read environment variables from, with a default value of .env.

XVars

Vector of strings specifying variables to be used in the model. This argument is mandatory and can not be empty.

CV_NFolds

Number of cross-validation folds. Default: 4.

CV_NGrids

For CV_Dist cross-validation strategy (see below), this argument determines the size of the blocks (how many grid cells in both directions).

CV_NR, CV_NC

Integer, the number of rows and columns used in the CV_Large cross-validation strategy (see below), in which the study area is divided into large blocks given the provided CV_NR and CV_NC values. Both default to 2 which means to split the study area into four large blocks at the median latitude and longitude.

CV_SAC

Logical. Indicating whether to use the spatial autocorrelation to determine the block size. Defaults to FALSE,

OutPath

String specifying the folder path to save the cross-validation results. Default: NULL.

FromHPC

Logical. Indicates whether the function is being run on an HPC environment, affecting file path handling. Default: TRUE.

CV_Plot

Logical. Indicating whether to plot the block cross-validation folds.

Value

The function returns a modified version of the input dataset DT with 3 additional numeric columns (integer) indicating the cross-validation folds:

  1. CV_SAC in which the size of the blocks is determined by the median spatial autocorrelation range in the predictor data (estimated using blockCV::cv_spatial_autocor). This requires the availability of the automap R package.

  2. CV_Dist in which the size of spatial cross-validation blocks is determined by the CV_NGrids argument. The default CV_NGrids value is 20, which means blocks of 20x20 grid cell each.

  3. CV_Large which splits the study area into large blocks, as determined by the CV_NR and CV_NC arguments. if CV_NR = CV_NC = 2 (default), four large blocks will be used, split the study area at the median coordinates.

Details

The function reads the following environment variable:

  • DP_R_Grid (if FromHPC = TRUE) or DP_R_Grid_Local (if FromHPC = FALSE). The function reads the content of the Grid_10_Land_Crop.RData file from this path.

  • DP_R_EUBound_sf (if FromHPC = TRUE) or DP_R_EUBound_sf_Local (if FromHPC = FALSE): path for the RData file containing the country boundaries (sf object).

Author

Ahmed El-Gabbas