Prepare initial models for model fitting with Hmsc-HPC

The mod_prepare_HPC function prepares input data and initialises models for fitting with Hmsc-HPC. It performs multiple tasks, including data preparation, defining spatial block cross-validation folds, generating Gaussian Predictive Process (GPP) knots (Tikhonov et al.), initialising models, and creating HPC execution commands. The function supports parallel processing and offers the option to include or exclude phylogenetic tree data.

The mod_prepare_data function is used to prepare habitat-specific data for Hmsc models. This function processes environmental and species presence data, reads environment variables from a file, verifies paths, loads and filters species data based on habitat type and minimum presence grid cells per species, and merges various environmental layers (e.g., CHELSA Bioclimatic variables, habitat coverage, road and railway intensity, sampling efforts) into a single dataset. Processed data is saved to disk as an *.RData file.

Usage

mod_prepare_HPC(
  hab_abb = NULL,
  directory_name = NULL,
  min_efforts_n_species = 100L,
  n_pres_per_species = 80L,
  env_file = ".env",
  GPP = TRUE,
  GPP_dists = NULL,
  GPP_save = TRUE,
  GPP_plot = TRUE,
  min_LF = NULL,
  max_LF = NULL,
  alphapw = list(Prior = NULL, Min = 20, Max = 1200, Samples = 200),
  bio_variables = c("bio3", "bio4", "bio11", "bio18", "bio19", "npp"),
  quadratic_variables = bio_variables,
  efforts_as_predictor = TRUE,
  road_rail_as_predictor = TRUE,
  habitat_as_predictor = TRUE,
  river_as_predictor = TRUE,
  n_species_per_grid = 0L,
  exclude_cultivated = TRUE,
  exclude_0_habitat = TRUE,
  CV_n_folds = 4L,
  CV_n_grids = 20L,
  CV_n_rows = 2L,
  CV_n_columns = 2L,
  CV_plot = TRUE,
  CV_SAC = FALSE,
  use_phylo_tree = TRUE,
  no_phylo_tree = FALSE,
  overwrite_rds = TRUE,
  n_cores = 8L,
  strategy = "multisession",
  MCMC_n_chains = 4L,
  MCMC_thin = NULL,
  MCMC_samples = 1000L,
  MCMC_transient_factor = 500L,
  MCMC_verbose = 200L,
  skip_fitted = TRUE,
  n_array_jobs = 210L,
  model_country = NULL,
  verbose_progress = TRUE,
  SLURM_prepare = TRUE,
  memory_per_cpu = "64G",
  job_runtime = NULL,
  job_name = NULL,
  path_Hmsc = NULL,
  check_python = FALSE,
  to_JSON = FALSE,
  precision = 64L,
  ...
)

mod_prepare_data(
  hab_abb = NULL,
  directory_name = NULL,
  min_efforts_n_species = 100L,
  exclude_cultivated = TRUE,
  exclude_0_habitat = TRUE,
  n_pres_per_species = 80L,
  env_file = ".env",
  verbose_progress = TRUE
)

Arguments

hab_abb: Character. Abbreviation for the habitat type (based on SynHab) for which to prepare data. Valid values are 0, 1, 2, 3, 4a, 4b, 10, 12a, 12b. If hab_abb = 0, data is prepared irrespective of the habitat type. For more details, see Pysek et al..
directory_name: Character. Directory name, without its parents, where the models will be saved. This directory will be created.
min_efforts_n_species: Integer. Minimum number of vascular plant species per grid cell (from GBIF data) required for inclusion in the models. This is to exclude grid cells with very little sampling efforts. Defaults to 100.
n_pres_per_species: Integer. The minimum number of presence grid cells for a species to be included in the analysis. The number of presence grid cells per species is calculated after discarding grid cells with low sampling efforts (min_efforts_n_species) and zero percentage habitat coverage exclude_0_habitat. Defaults to 80.
env_file: Character. Path to the environment file containing paths to data sources. Defaults to .env.
GPP: Logical. Whether to fit spatial random effect using Gaussian Predictive Process. Defaults to TRUE. If FALSE, non-spatial models will be fitted.
GPP_dists: Integer. Spacing (in kilometres) between GPP knots, as well as the minimum allowable distance between a knot and the nearest sampling point. The knots are generated using the prepare_knots function, and this value is used for both knotDist and minKnotDist in Hmsc::constructKnots.
GPP_save: Logical. Whether RData file. Default: TRUE.
GPP_plot: Logical. Whether to plot the coordinates of the sampling units and the knots in a pdf file. Default: TRUE.
min_LF, max_LF: Integer. Minimum and maximum number of latent factors to be used. Both default to NULL which means that the number of latent factors will be estimated from the data. If either is provided, the respective values will be used as arguments to Hmsc::setPriors.
alphapw: Prior for the alpha parameter. Defaults to a list with Prior = NULL, Min = 20, Max = 1200, and Samples = 200. If alphapw is NULL or a list with all NULL list items, the default prior will be used. If Prior is a matrix, it will be used as the prior. If Prior = NULL, the prior will be generated using Min, Max, and Samples. Min and Max are the minimum and maximum values of the alpha parameter (in kilometre). Samples is the number of samples to be used in the prior.
bio_variables: Character vector. Variables from CHELSA (bioclimatic variables (bio1-bio19) and additional predictors (e.g., Net Primary Productivity, npp)) to be used in the model. By default, six ecologically relevant and minimally correlated variables are selected: c("bio3", "bio4", "bio11", "bio18", "bio19", "npp").
quadratic_variables: Character vector for variables for which quadratic terms are used. Defaults to all variables of the bio_variables. If quadratic_variables is NULL, no quadratic terms will be used.
efforts_as_predictor: Logical. Whether to include the (log₁₀) sampling efforts as predictor to the model. Default: TRUE.
road_rail_as_predictor: Logical. Whether to include the (log₁₀) sum of road and railway intensity as predictor to the model. Default: TRUE.
habitat_as_predictor: Logical. Whether to include the (log₁₀) percentage coverage of respective habitat type per grid cell as predictor to the model. Default: TRUE. Only valid if hab_abb not equals to 0.
river_as_predictor: Logical. Whether to include the (log₁₀) total length of rivers per grid cell as predictor to the model. Default: TRUE. See river_length for more details.
n_species_per_grid: Integer. Minimum number of species required for a grid cell to be included in the analysis. This filtering occurs after applying min_efforts_n_species (sampling effort thresholds), n_pres_per_species (minimum species presence thresholds), and exclude_0_habitat (exclude 0% habitat coverage). Default (0): Includes all grid cells. Positive value (>0): Includes only grid cells where at least n_species_per_grid species are present.
exclude_cultivated: Logical. Whether to exclude countries with cultivated or casual observations per species. Defaults to TRUE.
exclude_0_habitat: Logical. Whether to exclude grid cells with zero percentage habitat coverage. Defaults to TRUE.
CV_n_folds: Integer. Number of cross-validation folds. Default: 4L.
CV_n_grids: Integer. For CV_Dist cross-validation strategy (see mod_CV_prepare), this argument determines the size of the blocks (how many grid cells in both directions).
CV_n_rows, CV_n_columns: Integer. Number of rows and columns used in the CV_Large cross-validation strategy (see mod_CV_prepare), in which the study area is divided into large blocks given the provided CV_n_rows and CV_n_columns values. Both default to 2 which means to split the study area into four large blocks at the median latitude and longitude.
CV_plot: Logical. Indicating whether to plot the block cross-validation folds.
CV_SAC: Logical. Whether to use the spatial autocorrelation to determine the block size. Defaults to FALSE,
use_phylo_tree, no_phylo_tree: Logical. Whether to fit models with (use_phylo_tree) or without (no_phylo_tree) phylogenetic trees. Defaults are use_phylo_tree = TRUE and no_phylo_tree = FALSE, meaning only models with phylogenetic trees are fitted by default. At least one of use_phylo_tree and no_phylo_tree should be TRUE.
overwrite_rds: Logical. Whether to overwrite previously exported RDS files for initial models. Default: TRUE.
n_cores: Integer. Number of CPU cores to use for parallel processing. Default: 8.
strategy: Character. The parallel processing strategy to use. Valid options are "sequential", "multisession" (default), "multicore", and "cluster". See future::plan() and ecokit::set_parallel() for details.
MCMC_n_chains: Integer. Number of model chains. Default: 4.
MCMC_thin: Integer vector. Thinning value(s) in MCMC sampling. If more than one value is provided, a separate model will be fitted at each value of thinning.
MCMC_samples: Integer vector. Value(s) for the number of MCMC samples. If more than one value is provided, a separate model will be fitted at each value of number of samples. Defaults to 1000.
MCMC_transient_factor: Integer. Transient multiplication factor. The value of transient will equal the multiplication of MCMC_transient_factor and MCMC_thin. Default: 500.
MCMC_verbose: Integer. Interval at which MCMC sampling progress is reported. Default: 200.
skip_fitted: Logical. Whether to skip already fitted models. Default: TRUE.
n_array_jobs: Integer. Number of jobs per SLURM script file. In LUMI HPC, there is a limit of 210 submitted jobs per user for the small-g partition. This argument is used to split the jobs into multiple SLURM scripts if needed. Default: 210. See LUMI documentation for more details.
model_country: Character. Country or countries to filter observations by. Default: NULL, which means prepare data for the whole Europe.
verbose_progress: Logical. Whether to print a message upon successful saving of files. Defaults to FALSE.
SLURM_prepare: Logical. Whether to prepare SLURM command files. If TRUE (default), the SLURM commands will be saved to disk using the mod_SLURM function.
memory_per_cpu: Character. Memory per CPU for the SLURM job. This value will be assigned to the #SBATCH --mem-per-cpu= SLURM argument. Example: "32G" to request 32 gigabyte. Only effective if SLURM_prepare = TRUE. Defaults to "64G".
job_runtime: Character. Requested time for each job in the SLURM bash arrays. Example: "01:00:00" to request an hour. Only effective if SLURM_prepare = TRUE.
job_name: Character. Name of the submitted job(s) for SLURM. If NULL (Default), the job name will be prepared based on the folder path and the hab_abb value. Only effective if SLURM_prepare = TRUE.
path_Hmsc: Character. Directory path to Hmsc-HPC extension installation. This will be provided as the path_Hmsc argument of the mod_SLURM function.
check_python: Logical. Whether to check if the Python executable exists.
to_JSON: Logical. Whether to convert unfitted models to JSON before saving to RDS file. Default: FALSE.
precision: Integer (either 32 or 64). Defines the floating-point precision mode for Hmsc-HPC sampling (–fp 32 or –fp 64). The default is 64, which is the default precision in Hmsc-HPC.
...: Additional parameters provided to the mod_SLURM function.

Author

Ahmed El-Gabbas