Prepare initial models for model fitting with Hmsc-HPC
Source:R/mod_prepare_HPC.R
, R/mod_prepare_data.R
Mod_inputs.Rd
The mod_prepare_HPC
function prepares input data and initialises models
for fitting with Hmsc-HPC. It
performs multiple tasks, including data preparation, defining spatial block
cross-validation folds, generating Gaussian Predictive Process (GPP) knots
(Tikhonov et al.), initialising models,
and creating HPC execution commands. The function supports parallel
processing and offers the option to include or exclude phylogenetic tree
data.
The mod_prepare_data
function is used to prepare
habitat-specific data for Hmsc models. This function processes environmental
and species presence data, reads environment variables from a file, verifies
paths, loads and filters species data based on habitat type and minimum
presence grid cells per species, and merges various environmental layers
(e.g., CHELSA Bioclimatic variables, habitat coverage, road and railway
intensity, sampling efforts) into a single dataset. Processed data is saved
to disk as an *.RData
file.
Usage
mod_prepare_HPC(
hab_abb = NULL,
directory_name = NULL,
min_efforts_n_species = 100L,
n_pres_per_species = 80L,
env_file = ".env",
GPP = TRUE,
GPP_dists = NULL,
GPP_save = TRUE,
GPP_plot = TRUE,
min_LF = NULL,
max_LF = NULL,
alphapw = list(Prior = NULL, Min = 20, Max = 1200, Samples = 200),
bio_variables = c("bio3", "bio4", "bio11", "bio18", "bio19", "npp"),
quadratic_variables = bio_variables,
efforts_as_predictor = TRUE,
road_rail_as_predictor = TRUE,
habitat_as_predictor = TRUE,
river_as_predictor = TRUE,
n_species_per_grid = 0L,
exclude_cultivated = TRUE,
exclude_0_habitat = TRUE,
CV_n_folds = 4L,
CV_n_grids = 20L,
CV_n_rows = 2L,
CV_n_columns = 2L,
CV_plot = TRUE,
CV_SAC = FALSE,
use_phylo_tree = TRUE,
no_phylo_tree = FALSE,
overwrite_rds = TRUE,
n_cores = 8L,
strategy = "multisession",
MCMC_n_chains = 4L,
MCMC_thin = NULL,
MCMC_samples = 1000L,
MCMC_transient_factor = 500L,
MCMC_verbose = 200L,
skip_fitted = TRUE,
n_array_jobs = 210L,
model_country = NULL,
verbose_progress = TRUE,
SLURM_prepare = TRUE,
memory_per_cpu = "64G",
job_runtime = NULL,
job_name = NULL,
path_Hmsc = NULL,
check_python = FALSE,
to_JSON = FALSE,
precision = 64L,
...
)
mod_prepare_data(
hab_abb = NULL,
directory_name = NULL,
min_efforts_n_species = 100L,
exclude_cultivated = TRUE,
exclude_0_habitat = TRUE,
n_pres_per_species = 80L,
env_file = ".env",
verbose_progress = TRUE
)
Arguments
- hab_abb
Character. Abbreviation for the habitat type (based on SynHab) for which to prepare data. Valid values are
0
,1
,2
,3
,4a
,4b
,10
,12a
,12b
. Ifhab_abb
=0
, data is prepared irrespective of the habitat type. For more details, see Pysek et al..- directory_name
Character. Directory name, without its parents, where the models will be saved. This directory will be created.
- min_efforts_n_species
Integer. Minimum number of vascular plant species per grid cell (from GBIF data) required for inclusion in the models. This is to exclude grid cells with very little sampling efforts. Defaults to
100
.- n_pres_per_species
Integer. The minimum number of presence grid cells for a species to be included in the analysis. The number of presence grid cells per species is calculated after discarding grid cells with low sampling efforts (
min_efforts_n_species
) and zero percentage habitat coverageexclude_0_habitat
. Defaults to80
.- env_file
Character. Path to the environment file containing paths to data sources. Defaults to
.env
.- GPP
Logical. Whether to fit spatial random effect using Gaussian Predictive Process. Defaults to
TRUE
. IfFALSE
, non-spatial models will be fitted.- GPP_dists
Integer. Spacing (in kilometres) between GPP knots, as well as the minimum allowable distance between a knot and the nearest sampling point. The knots are generated using the prepare_knots function, and this value is used for both
knotDist
andminKnotDist
in Hmsc::constructKnots.- GPP_save
Logical. Whether
RData
file. Default:TRUE
.- GPP_plot
Logical. Whether to plot the coordinates of the sampling units and the knots in a pdf file. Default:
TRUE
.- min_LF, max_LF
Integer. Minimum and maximum number of latent factors to be used. Both default to
NULL
which means that the number of latent factors will be estimated from the data. If either is provided, the respective values will be used as arguments to Hmsc::setPriors.- alphapw
Prior for the alpha parameter. Defaults to a list with
Prior = NULL
,Min = 20
,Max = 1200
, andSamples = 200
. Ifalphapw
isNULL
or a list with allNULL
list items, the default prior will be used. IfPrior
is a matrix, it will be used as the prior. IfPrior = NULL
, the prior will be generated usingMin
,Max
, andSamples
.Min
andMax
are the minimum and maximum values of the alpha parameter (in kilometre).Samples
is the number of samples to be used in the prior.- bio_variables
Character vector. Variables from CHELSA (bioclimatic variables (bio1-bio19) and additional predictors (e.g., Net Primary Productivity, npp)) to be used in the model. By default, six ecologically relevant and minimally correlated variables are selected: c("bio3", "bio4", "bio11", "bio18", "bio19", "npp").
- quadratic_variables
Character vector for variables for which quadratic terms are used. Defaults to all variables of the
bio_variables
. Ifquadratic_variables
isNULL
, no quadratic terms will be used.- efforts_as_predictor
Logical. Whether to include the (log10) sampling efforts as predictor to the model. Default:
TRUE
.- road_rail_as_predictor
Logical. Whether to include the (log10) sum of road and railway intensity as predictor to the model. Default:
TRUE
.- habitat_as_predictor
Logical. Whether to include the (log10) percentage coverage of respective habitat type per grid cell as predictor to the model. Default:
TRUE
. Only valid ifhab_abb
not equals to0
.- river_as_predictor
Logical. Whether to include the (log10) total length of rivers per grid cell as predictor to the model. Default:
TRUE
. See river_length for more details.- n_species_per_grid
Integer. Minimum number of species required for a grid cell to be included in the analysis. This filtering occurs after applying
min_efforts_n_species
(sampling effort thresholds),n_pres_per_species
(minimum species presence thresholds), andexclude_0_habitat
(exclude 0% habitat coverage). Default (0): Includes all grid cells. Positive value (>0): Includes only grid cells where at leastn_species_per_grid
species are present.- exclude_cultivated
Logical. Whether to exclude countries with cultivated or casual observations per species. Defaults to
TRUE
.- exclude_0_habitat
Logical. Whether to exclude grid cells with zero percentage habitat coverage. Defaults to
TRUE
.- CV_n_folds
Integer. Number of cross-validation folds. Default: 4L.
- CV_n_grids
Integer. For
CV_Dist
cross-validation strategy (see mod_CV_prepare), this argument determines the size of the blocks (how many grid cells in both directions).- CV_n_rows, CV_n_columns
Integer. Number of rows and columns used in the
CV_Large
cross-validation strategy (see mod_CV_prepare), in which the study area is divided into large blocks given the providedCV_n_rows
andCV_n_columns
values. Both default to 2 which means to split the study area into four large blocks at the median latitude and longitude.- CV_plot
Logical. Indicating whether to plot the block cross-validation folds.
- CV_SAC
Logical. Whether to use the spatial autocorrelation to determine the block size. Defaults to
FALSE
,- use_phylo_tree, no_phylo_tree
Logical. Whether to fit models with (use_phylo_tree) or without (no_phylo_tree) phylogenetic trees. Defaults are
use_phylo_tree = TRUE
andno_phylo_tree = FALSE
, meaning only models with phylogenetic trees are fitted by default. At least one ofuse_phylo_tree
andno_phylo_tree
should beTRUE
.- overwrite_rds
Logical. Whether to overwrite previously exported RDS files for initial models. Default:
TRUE
.- n_cores
Integer. Number of CPU cores to use for parallel processing. Default: 8.
- strategy
Character. The parallel processing strategy to use. Valid options are "sequential", "multisession" (default), "multicore", and "cluster". See
future::plan()
andecokit::set_parallel()
for details.- MCMC_n_chains
Integer. Number of model chains. Default: 4.
- MCMC_thin
Integer vector. Thinning value(s) in MCMC sampling. If more than one value is provided, a separate model will be fitted at each value of thinning.
- MCMC_samples
Integer vector. Value(s) for the number of MCMC samples. If more than one value is provided, a separate model will be fitted at each value of number of samples. Defaults to 1000.
- MCMC_transient_factor
Integer. Transient multiplication factor. The value of
transient
will equal the multiplication ofMCMC_transient_factor
andMCMC_thin
. Default: 500.- MCMC_verbose
Integer. Interval at which MCMC sampling progress is reported. Default:
200
.- skip_fitted
Logical. Whether to skip already fitted models. Default:
TRUE
.- n_array_jobs
Integer. Number of jobs per SLURM script file. In LUMI HPC, there is a limit of 210 submitted jobs per user for the
small-g
partition. This argument is used to split the jobs into multiple SLURM scripts if needed. Default: 210. See LUMI documentation for more details.- model_country
Character. Country or countries to filter observations by. Default:
NULL
, which means prepare data for the whole Europe.- verbose_progress
Logical. Whether to print a message upon successful saving of files. Defaults to
FALSE
.- SLURM_prepare
Logical. Whether to prepare SLURM command files. If
TRUE
(default), the SLURM commands will be saved to disk using the mod_SLURM function.- memory_per_cpu
Character. Memory per CPU for the SLURM job. This value will be assigned to the
#SBATCH --mem-per-cpu=
SLURM argument. Example: "32G" to request 32 gigabyte. Only effective ifSLURM_prepare = TRUE
. Defaults to "64G".- job_runtime
Character. Requested time for each job in the SLURM bash arrays. Example: "01:00:00" to request an hour. Only effective if
SLURM_prepare = TRUE
.- job_name
Character. Name of the submitted job(s) for SLURM. If
NULL
(Default), the job name will be prepared based on the folder path and thehab_abb
value. Only effective ifSLURM_prepare = TRUE
.- path_Hmsc
Character. Directory path to
Hmsc-HPC
extension installation. This will be provided as thepath_Hmsc
argument of the mod_SLURM function.- check_python
Logical. Whether to check if the Python executable exists.
- to_JSON
Logical. Whether to convert unfitted models to JSON before saving to RDS file. Default:
FALSE
.- precision
Integer (either 32 or 64). Defines the floating-point precision mode for
Hmsc-HPC
sampling (–fp 32 or –fp 64). The default is 64, which is the default precision inHmsc-HPC
.- ...
Additional parameters provided to the mod_SLURM function.