Skip to contents

This function prepares initial models in R use with Hmsc-HPC. It includes data preparation, define spatial block cross-validation folds, initializing models, generating Gaussian Predictive Process (GPP) knots, and creating commands for HPC execution. It supports parallel processing, options to include/not include phylogenetic tree data. The models will be fitted using Gaussian Predictive Process (GPP; see Tikhonov et al. for more details) via the Hmsc-HPC extension.

Usage

Mod_Prep4HPC(
  Hab_Abb = NULL,
  Path_Model = NULL,
  MinEffortsSp = 100L,
  PresPerSpecies = 80L,
  EnvFile = ".env",
  GPP = TRUE,
  GPP_Dists = NULL,
  GPP_Save = TRUE,
  GPP_Plot = TRUE,
  MinLF = NULL,
  MaxLF = NULL,
  Alphapw = list(Prior = NULL, Min = 20, Max = 1200, Samples = 200),
  BioVars = c("bio3", "bio4", "bio11", "bio18", "bio19", "npp"),
  QuadraticVars = BioVars,
  EffortsAsPredictor = TRUE,
  RoadRailAsPredictor = TRUE,
  HabAsPredictor = TRUE,
  RiversAsPredictor = TRUE,
  NspPerGrid = 0L,
  ExcludeCult = TRUE,
  ExcludeZeroHabitat = TRUE,
  CV_NFolds = 4L,
  CV_NGrids = 20L,
  CV_NR = 2L,
  CV_NC = 2L,
  CV_Plot = TRUE,
  CV_SAC = FALSE,
  PhyloTree = TRUE,
  NoPhyloTree = FALSE,
  OverwriteRDS = TRUE,
  NCores = 8L,
  NChains = 4L,
  thin = NULL,
  samples = 1000L,
  transientFactor = 500L,
  verbose = 200L,
  SkipFitted = TRUE,
  NumArrayJobs = 210L,
  ModelCountry = NULL,
  VerboseProgress = TRUE,
  FromHPC = TRUE,
  PrepSLURM = TRUE,
  MemPerCpu = NULL,
  Time = NULL,
  JobName = NULL,
  Path_Hmsc = NULL,
  CheckPython = FALSE,
  ToJSON = FALSE,
  Precision = 64,
  ...
)

Arguments

Hab_Abb

Character. Abbreviation for the habitat type (based on SynHab) for which to prepare data. Valid values are 0, 1, 2, 3, 4a, 4b, 10, 12a, 12b. If Hab_Abb = 0, data is prepared irrespective of the habitat type. For more details, see Pysek et al..

Path_Model

String (without trailing slash) specifying the path where all output, including models to be fitted, will be saved.

MinEffortsSp

Integer specifying the minimum number of vascular plant species per grid cell (from GBIF data) required for inclusion in the models. This is to exclude grid cells with very little sampling efforts. Defaults to 100.

PresPerSpecies

Integer. The minimum number of presence grid cells for a species to be included in the analysis. The number of presence grid cells per species is calculated after discarding grid cells with low sampling efforts (MinEffortsSp). Defaults to 80.

EnvFile

Character. Path to the environment file containing paths to data sources. Defaults to .env.

GPP

Logical indicating whether to fit spatial random effect using Gaussian Predictive Process. Defaults to TRUE. If FALSE, non-spatial models will be fitted.

GPP_Dists

Integer specifying the distance in kilometers used both for the spacing between knots and the minimum allowable distance between a knot and the nearest sampling point. The GPP knots are prepared by the PrepKnots function. The same value will be used for the knotDist and minKnotDist arguments of the Hmsc::constructKnots function.

GPP_Save

Logical indicating whether to save the resulted knots as RData file. Default: TRUE.

GPP_Plot

Logical indicating whether to plot the coordinates of the sampling units and the knots in a pdf file. Default: TRUE.

MinLF, MaxLF

integer. Minimum and maximum number of latent factors to be used. Both default to NULL which means that the number of latent factors will be estimated from the data. If either is provided, the respective values will be used as arguments to Hmsc::setPriors.

Alphapw

Prior for the alpha parameter. Defaults to a list with Prior = NULL, Min = 20, Max = 1200, and Samples = 200. If Alphapw is NULL or a list with all NULL list items, the default prior will be used. If Prior is a matrix, it will be used as the prior. If Prior is NULL, the prior will be generated using Min, Max, and Samples. Min and Max are the minimum and maximum values of the alpha parameter (in kilometer). Samples is the number of samples to be used in the prior.

BioVars

Character vector. Specifies variables from CHELSA to be used in the model. This can include bioclimatic variables (bio1-19) as well as other predictors such as npp (Net Primary Productivity). Defaults to 6 ecologically meaningful and less correlated variables: c("bio3", "bio4", "bio11", "bio18", "bio19", "npp").

QuadraticVars

Character vector for variables for which quadratic terms are used. Defaults to all variables of the BioVars. If QuadraticVars is NULL, no quadratic terms will be used.

EffortsAsPredictor

Logical indicating whether to include the (log10) sampling efforts as predictor to the model. Default: TRUE.

RoadRailAsPredictor

Logical indicating whether to include the (log10) sum of road and railway intensity as predictor to the model. Default: TRUE.

HabAsPredictor

Logical indicating whether to include the (log10) percentage coverage of respective habitat type per grid cell as predictor to the model. Default: TRUE. Only valid if Hab_Abb not equals to 0.

RiversAsPredictor

Logical indicating whether to include the total length of rivers per grid cell as predictor to the model. Default: TRUE. See River_Length for more details.

NspPerGrid

Integer. Indicating the minimum number of species per grid cell for a grid cell to be include in the analysis. This is calculated after filtering grid cells by sampling efforts (MinEffortsSp) and filtering species by the number of presence grid cells (PresPerSpecies). If NspPerGrid = 0 (default), all grid cells will be used in the models. If NspPerGrid > 0, only grid cells with >= NspPerGrid species presence will be considered in the models.

ExcludeCult

Logical. Indicates whether to exclude countries with cultivated or casual observations per species. Defaults to TRUE.

ExcludeZeroHabitat

Logical. Indicates whether to exclude grid cells with zero habitat coverage. Defaults to TRUE.

CV_NFolds

Number of cross-validation folds. Default: 4.

CV_NGrids

For CV_Dist cross-validation strategy (see below), this argument determines the size of the blocks (how many grid cells in both directions).

CV_NR, CV_NC

Integer, the number of rows and columns used in the CV_Large cross-validation strategy (see below), in which the study area is divided into large blocks given the provided CV_NR and CV_NC values. Both default to 2 which means to split the study area into four large blocks at the median latitude and longitude.

CV_Plot

Logical. Indicating whether to plot the block cross-validation folds.

CV_SAC

Logical. Indicating whether to use the spatial autocorrelation to determine the block size. Defaults to FALSE,

PhyloTree, NoPhyloTree

Logical parameters indicating whether to fit models with (PhyloTree) or without (NoPhyloTree) phylogenetic trees. Defaults are PhyloTree = TRUE and NoPhyloTree = FALSE, meaning only models with phylogenetic trees are fitted by default. At least one of PhyloTree and NoPhyloTree should be TRUE.

OverwriteRDS

Logical. Indicating whether to overwrite previously exported RDS files for initial models. Default: TRUE.

NCores

Integer specifying the number of parallel cores for parallelization. Default: 8 cores.

NChains

Integer specifying the number of model chains. Default: 4.

thin

Integer specifying the value(s) for thinning in MCMC sampling. If more than one value is provided, a separate model will be fitted at each value of thinning.

samples

Integer specifying the value(s) for the number of MCMC samples. If more than one value is provided, a separate model will be fitted at each value of number of samples. Defaults to 1000.

transientFactor

Integer specifying the transient multiplication factor. The value of transient will equal the multiplication of transientFactor and thin. Default: 500.

verbose

Integer indicating the interval at which MCMC sampling progress is reported. Default: 200.

SkipFitted

Logical indicating whether to skip already fitted models. Default: TRUE.

NumArrayJobs

Integer specifying the maximum number of array jobs per SLURM script. Default: 210. See LUMI documentation for more details.

ModelCountry

String or vector of strings specifying the country or countries to filter observations by. Default: NULL, which means prepare data for the whole Europe.

VerboseProgress

Logical. Indicates whether progress messages should be displayed. Defaults to TRUE.

FromHPC

Logical indicating whether the work is being done from HPC, to adjust file paths accordingly. Default: TRUE.

PrepSLURM

Logical indicating whether to prepare SLURM command files. If TRUE (default), the SLURM commands will be saved to disk using the Mod_SLURM function.

MemPerCpu

String specifying the memory per CPU for the SLURM job. This value will be assigned to the #SBATCH --mem-per-cpu= SLURM argument. Example: "32G" to request 32 gigabyte. Only effective if PrepSLURM = TRUE.

Time

String specifying the requested time for each job in the SLURM bash arrays. Example: "01:00:00" to request an hour. Only effective if PrepSLURM = TRUE.

JobName

String specifying the name of the submitted job(s) for SLURM. If NULL (Default), the job name will be prepared based on the folder path and the Hab_Abb value. Only effective if PrepSLURM = TRUE.

Path_Hmsc

String specifying the path for the Hmsc-HPC. This will be provided as the Path_Hmsc argument of the Mod_SLURM function.

CheckPython

Logical indicating whether to check if the Python executable exists. Only valid if FromHPC = FALSE.

ToJSON

Logical indicating whether to convert unfitted models to JSON before saving to RDS file. Default: FALSE.

Precision

Integer, either of 32 (default; --fp 32) or 64 for the precision mode used for sampling while fitting Hmsc-HPC models (--fp 64 argument). In Hmsc-HPC, the default value is 64. This is still under testing.

...

Additional parameters provided to the Mod_SLURM function.

Value

The function is used for its side effects of preparing data and models for HPC and does not return any value.

Details

The function provides options for:

  • for which habitat types the models will be fitted

  • excluding grid cells with very low sampling efforts (MinEffortsSp)

  • selection of species based on minimum number of presence-grid cells: PresPerSpecies.

  • optionally model fitting on specified list of countries: (ModelCountry)

  • whether to exclude grid cells with few species (NspPerGrid)

  • number of cross-validation folds

  • options for whether or not to include phylogenetic information to the model

  • different values for knot distance for GPP (GPP_Dists)

  • which Bioclimatic variables to be uses in the models (BioVars)

  • whether to include sampling efforts EffortsAsPredictor, percentage of respective habitat type per grid cell HabAsPredictor, and railway and road intensity per grid cell RoadRailAsPredictor

  • Hmsc options (NChains, thin, samples, transientFactor, and verbose)

  • prepare SLURM commands (PrepSLURM) and some specifications (e.g. NumArrayJobs, MemPerCpu, Time, JobName)

    The function reads the following environment variables:

    • DP_R_Grid (if FromHPC = TRUE) or DP_R_Grid_Local (if FromHPC = FALSE). The function reads the content of the Grid_10_Land_Crop.RData file from this path.

    • DP_R_TaxaInfo or DP_R_TaxaInfo_Local for the location of the Species_List_ID.txt file representing species information.

    • DP_R_EUBound_sf or DP_R_EUBound_sf_Local for the path of the RData file containing the country boundaries (sf object)

    • DP_R_PA or DP_R_PA_Local: The function reads the contents of the Sp_PA_Summary_DF.RData file from this path

Author

Ahmed El-Gabbas