This function prepares initial models in R use with Hmsc-HPC. It includes data preparation, define spatial block cross-validation folds, initializing models, generating Gaussian Predictive Process (GPP) knots, and creating commands for HPC execution. It supports parallel processing, options to include/not include phylogenetic tree data. The models will be fitted using Gaussian Predictive Process (GPP; see Tikhonov et al. for more details) via the Hmsc-HPC extension.


  Hab_Abb = NULL,
  Path_Model = NULL,
  MinEffortsSp = 100L,
  PresPerSpecies = 80L,
  EnvFile = ".env",
  GPP_Dists = NULL,
  GPP_Save = TRUE,
  GPP_Plot = TRUE,
  MinLF = NULL,
  MaxLF = NULL,
  Alphapw = list(Prior = NULL, Min = 20, Max = 1200, Samples = 200),
  BioVars = c("bio3", "bio4", "bio11", "bio18", "bio19", "npp"),
  QuadraticVars = BioVars,
  EffortsAsPredictor = TRUE,
  RoadRailAsPredictor = TRUE,
  HabAsPredictor = TRUE,
  RiversAsPredictor = TRUE,
  NspPerGrid = 0L,
  ExcludeCult = TRUE,
  ExcludeZeroHabitat = TRUE,
  CV_NFolds = 4L,
  CV_NGrids = 20L,
  CV_NR = 2L,
  CV_NC = 2L,
  CV_Plot = TRUE,
  PhyloTree = TRUE,
  NoPhyloTree = FALSE,
  OverwriteRDS = TRUE,
  NCores = 8L,
  NChains = 4L,
  thin = NULL,
  samples = 1000L,
  transientFactor = 500L,
  verbose = 200L,
  SkipFitted = TRUE,
  NumArrayJobs = 210L,
  ModelCountry = NULL,
  VerboseProgress = TRUE,
  FromHPC = TRUE,
  MemPerCpu = NULL,
  Time = NULL,
  JobName = NULL,
  Path_Hmsc = NULL,
  CheckPython = FALSE,
  Precision = 64,



Character. Abbreviation for the habitat type (based on SynHab) for which to prepare data. Valid values are 0, 1, 2, 3, 4a, 4b, 10, 12a, 12b. If Hab_Abb = 0, data is prepared irrespective of the habitat type. For more details, see Pysek et al..


String (without trailing slash) specifying the path where all output, including models to be fitted, will be saved.


Integer specifying the minimum number of vascular plant species per grid cell (from GBIF data) required for inclusion in the models. This is to exclude grid cells with very little sampling efforts. Defaults to 100.


Integer. The minimum number of presence grid cells for a species to be included in the analysis. The number of presence grid cells per species is calculated after discarding grid cells with low sampling efforts (MinEffortsSp). Defaults to 80.


Character. Path to the environment file containing paths to data sources. Defaults to .env.


Logical indicating whether to fit spatial random effect using Gaussian Predictive Process. Defaults to TRUE. If FALSE, non-spatial models will be fitted.


Integer specifying the distance in kilometers used both for the spacing between knots and the minimum allowable distance between a knot and the nearest sampling point. The GPP knots are prepared by the PrepKnots function. The same value will be used for the knotDist and minKnotDist arguments of the Hmsc::constructKnots function.


Logical indicating whether to save the resulted knots as RData file. Default: TRUE.


Logical indicating whether to plot the coordinates of the sampling units and the knots in a pdf file. Default: TRUE.

MinLF, MaxLF

integer. Minimum and maximum number of latent factors to be used. Both default to NULL which means that the number of latent factors will be estimated from the data. If either is provided, the respective values will be used as arguments to Hmsc::setPriors.


Prior for the alpha parameter. Defaults to a list with Prior = NULL, Min = 20, Max = 1200, and Samples = 200. If Alphapw is NULL or a list with all NULL list items, the default prior will be used. If Prior is a matrix, it will be used as the prior. If Prior is NULL, the prior will be generated using Min, Max, and Samples. Min and Max are the minimum and maximum values of the alpha parameter (in kilometer). Samples is the number of samples to be used in the prior.


Character vector. Specifies variables from CHELSA to be used in the model. This can include bioclimatic variables (bio1-19) as well as other predictors such as npp (Net Primary Productivity). Defaults to 6 ecologically meaningful and less correlated variables: c("bio3", "bio4", "bio11", "bio18", "bio19", "npp").


Character vector for variables for which quadratic terms are used. Defaults to all variables of the BioVars. If QuadraticVars is NULL, no quadratic terms will be used.


Logical indicating whether to include the (log10) sampling efforts as predictor to the model. Default: TRUE.


Logical indicating whether to include the (log10) sum of road and railway intensity as predictor to the model. Default: TRUE.


Logical indicating whether to include the (log10) percentage coverage of respective habitat type per grid cell as predictor to the model. Default: TRUE. Only valid if Hab_Abb not equals to 0.


Logical indicating whether to include the total length of rivers per grid cell as predictor to the model. Default: TRUE. See River_Length for more details.


Integer. Indicating the minimum number of species per grid cell for a grid cell to be include in the analysis. This is calculated after filtering grid cells by sampling efforts (MinEffortsSp) and filtering species by the number of presence grid cells (PresPerSpecies). If NspPerGrid = 0 (default), all grid cells will be used in the models. If NspPerGrid > 0, only grid cells with >= NspPerGrid species presence will be considered in the models.


Logical. Indicates whether to exclude countries with cultivated or casual observations per species. Defaults to TRUE.


Logical. Indicates whether to exclude grid cells with zero habitat coverage. Defaults to TRUE.


Number of cross-validation folds. Default: 4.


For CV_Dist cross-validation strategy (see below), this argument determines the size of the blocks (how many grid cells in both directions).


Integer, the number of rows and columns used in the CV_Large cross-validation strategy (see below), in which the study area is divided into large blocks given the provided CV_NR and CV_NC values. Both default to 2 which means to split the study area into four large blocks at the median latitude and longitude.


Logical. Indicating whether to plot the block cross-validation folds.


Logical. Indicating whether to use the spatial autocorrelation to determine the block size. Defaults to FALSE,

PhyloTree, NoPhyloTree

Logical parameters indicating whether to fit models with (PhyloTree) or without (NoPhyloTree) phylogenetic trees. Defaults are PhyloTree = TRUE and NoPhyloTree = FALSE, meaning only models with phylogenetic trees are fitted by default. At least one of PhyloTree and NoPhyloTree should be TRUE.


Logical. Indicating whether to overwrite previously exported RDS files for initial models. Default: TRUE.


Integer specifying the number of parallel cores for parallelization. Default: 8 cores.


Integer specifying the number of model chains. Default: 4.


Integer specifying the value(s) for thinning in MCMC sampling. If more than one value is provided, a separate model will be fitted at each value of thinning.


Integer specifying the value(s) for the number of MCMC samples. If more than one value is provided, a separate model will be fitted at each value of number of samples. Defaults to 1000.


Integer specifying the transient multiplication factor. The value of transient will equal the multiplication of transientFactor and thin. Default: 500.


Integer indicating the interval at which MCMC sampling progress is reported. Default: 200.


Logical indicating whether to skip already fitted models. Default: TRUE.


Integer specifying the maximum number of array jobs per SLURM script. Default: 210. See LUMI documentation for more details.


String or vector of strings specifying the country or countries to filter observations by. Default: NULL, which means prepare data for the whole Europe.


Logical. Indicates whether progress messages should be displayed. Defaults to TRUE.


Logical indicating whether the work is being done from HPC, to adjust file paths accordingly. Default: TRUE.


Logical indicating whether to prepare SLURM command files. If TRUE (default), the SLURM commands will be saved to disk using the Mod_SLURM function.


String specifying the memory per CPU for the SLURM job. This value will be assigned to the #SBATCH --mem-per-cpu= SLURM argument. Example: "32G" to request 32 gigabyte. Only effective if PrepSLURM = TRUE.


String specifying the requested time for each job in the SLURM bash arrays. Example: "01:00:00" to request an hour. Only effective if PrepSLURM = TRUE.


String specifying the name of the submitted job(s) for SLURM. If NULL (Default), the job name will be prepared based on the folder path and the Hab_Abb value. Only effective if PrepSLURM = TRUE.


String specifying the path for the Hmsc-HPC. This will be provided as the Path_Hmsc argument of the Mod_SLURM function.


Logical indicating whether to check if the Python executable exists. Only valid if FromHPC = FALSE.


Logical indicating whether to convert unfitted models to JSON before saving to RDS file. Default: FALSE.


Integer, either of 32 (default; --fp 32) or 64 for the precision mode used for sampling while fitting Hmsc-HPC models (--fp 64 argument). In Hmsc-HPC, the default value is 64. This is still under testing.


Additional parameters provided to the Mod_SLURM function.


The function is used for its side effects of preparing data and models for HPC and does not return any value.


The function provides options for:

  • for which habitat types the models will be fitted

  • excluding grid cells with very low sampling efforts (MinEffortsSp)

  • selection of species based on minimum number of presence-grid cells: PresPerSpecies.

  • optionally model fitting on specified list of countries: (ModelCountry)

  • whether to exclude grid cells with few species (NspPerGrid)

  • number of cross-validation folds

  • options for whether or not to include phylogenetic information to the model

  • different values for knot distance for GPP (GPP_Dists)

  • which Bioclimatic variables to be uses in the models (BioVars)

  • whether to include sampling efforts EffortsAsPredictor, percentage of respective habitat type per grid cell HabAsPredictor, and railway and road intensity per grid cell RoadRailAsPredictor

  • Hmsc options (NChains, thin, samples, transientFactor, and verbose)

  • prepare SLURM commands (PrepSLURM) and some specifications (e.g. NumArrayJobs, MemPerCpu, Time, JobName)

    The function reads the following environment variables:

    • DP_R_Grid (if FromHPC = TRUE) or DP_R_Grid_Local (if FromHPC = FALSE). The function reads the content of the Grid_10_Land_Crop.RData file from this path.

    • DP_R_TaxaInfo or DP_R_TaxaInfo_Local for the location of the Species_List_ID.txt file representing species information.

    • DP_R_EUBound_sf or DP_R_EUBound_sf_Local for the path of the RData file containing the country boundaries (sf object)

    • DP_R_PA or DP_R_PA_Local: The function reads the contents of the Sp_PA_Summary_DF.RData file from this path


Ahmed El-Gabbas