Prepare initial models in R for model fitting with Hmsc-HPC
Source:R/Mod_Prep4HPC.R
Mod_Prep4HPC.Rd
This function prepares initial models in R use with Hmsc-HPC. It includes data preparation, define spatial block cross-validation folds, initializing models, generating Gaussian Predictive Process (GPP) knots, and creating commands for HPC execution. It supports parallel processing, options to include/not include phylogenetic tree data. The models will be fitted using Gaussian Predictive Process (GPP; see Tikhonov et al. for more details) via the Hmsc-HPC extension.
Usage
Mod_Prep4HPC(
Hab_Abb = NULL,
Path_Model = NULL,
MinEffortsSp = 100L,
PresPerSpecies = 80L,
EnvFile = ".env",
GPP = TRUE,
GPP_Dists = NULL,
GPP_Save = TRUE,
GPP_Plot = TRUE,
MinLF = NULL,
MaxLF = NULL,
Alphapw = list(Prior = NULL, Min = 20, Max = 1200, Samples = 200),
BioVars = c("bio3", "bio4", "bio11", "bio18", "bio19", "npp"),
QuadraticVars = BioVars,
EffortsAsPredictor = TRUE,
RoadRailAsPredictor = TRUE,
HabAsPredictor = TRUE,
RiversAsPredictor = TRUE,
NspPerGrid = 0L,
ExcludeCult = TRUE,
ExcludeZeroHabitat = TRUE,
CV_NFolds = 4L,
CV_NGrids = 20L,
CV_NR = 2L,
CV_NC = 2L,
CV_Plot = TRUE,
CV_SAC = FALSE,
PhyloTree = TRUE,
NoPhyloTree = FALSE,
OverwriteRDS = TRUE,
NCores = 8L,
NChains = 4L,
thin = NULL,
samples = 1000L,
transientFactor = 500L,
verbose = 200L,
SkipFitted = TRUE,
NumArrayJobs = 210L,
ModelCountry = NULL,
VerboseProgress = TRUE,
FromHPC = TRUE,
PrepSLURM = TRUE,
MemPerCpu = NULL,
Time = NULL,
JobName = NULL,
Path_Hmsc = NULL,
CheckPython = FALSE,
ToJSON = FALSE,
Precision = 64,
...
)
Arguments
- Hab_Abb
Character. Abbreviation for the habitat type (based on SynHab) for which to prepare data. Valid values are
0
,1
,2
,3
,4a
,4b
,10
,12a
,12b
. IfHab_Abb
=0
, data is prepared irrespective of the habitat type. For more details, see Pysek et al..- Path_Model
String (without trailing slash) specifying the path where all output, including models to be fitted, will be saved.
- MinEffortsSp
Integer specifying the minimum number of vascular plant species per grid cell (from GBIF data) required for inclusion in the models. This is to exclude grid cells with very little sampling efforts. Defaults to
100
.- PresPerSpecies
Integer. The minimum number of presence grid cells for a species to be included in the analysis. The number of presence grid cells per species is calculated after discarding grid cells with low sampling efforts (
MinEffortsSp
). Defaults to80
.- EnvFile
Character. Path to the environment file containing paths to data sources. Defaults to
.env
.- GPP
Logical indicating whether to fit spatial random effect using Gaussian Predictive Process. Defaults to
TRUE
. IfFALSE
, non-spatial models will be fitted.- GPP_Dists
Integer specifying the distance in kilometers used both for the spacing between knots and the minimum allowable distance between a knot and the nearest sampling point. The GPP knots are prepared by the PrepKnots function. The same value will be used for the
knotDist
andminKnotDist
arguments of the Hmsc::constructKnots function.- GPP_Save
Logical indicating whether to save the resulted knots as
RData
file. Default:TRUE
.- GPP_Plot
Logical indicating whether to plot the coordinates of the sampling units and the knots in a pdf file. Default:
TRUE
.- MinLF, MaxLF
integer. Minimum and maximum number of latent factors to be used. Both default to
NULL
which means that the number of latent factors will be estimated from the data. If either is provided, the respective values will be used as arguments to Hmsc::setPriors.- Alphapw
Prior for the alpha parameter. Defaults to a list with
Prior = NULL
,Min = 20
,Max = 1200
, andSamples = 200
. IfAlphapw
is NULL or a list with all NULL list items, the default prior will be used. IfPrior
is a matrix, it will be used as the prior. IfPrior
isNULL
, the prior will be generated usingMin
,Max
, andSamples
.Min
andMax
are the minimum and maximum values of the alpha parameter (in kilometer).Samples
is the number of samples to be used in the prior.- BioVars
Character vector. Specifies variables from CHELSA to be used in the model. This can include bioclimatic variables (bio1-19) as well as other predictors such as npp (Net Primary Productivity). Defaults to 6 ecologically meaningful and less correlated variables:
c("bio3", "bio4", "bio11", "bio18", "bio19", "npp")
.- QuadraticVars
Character vector for variables for which quadratic terms are used. Defaults to all variables of the
BioVars
. IfQuadraticVars
isNULL
, no quadratic terms will be used.- EffortsAsPredictor
Logical indicating whether to include the (log10) sampling efforts as predictor to the model. Default:
TRUE
.- RoadRailAsPredictor
Logical indicating whether to include the (log10) sum of road and railway intensity as predictor to the model. Default:
TRUE
.- HabAsPredictor
Logical indicating whether to include the (log10) percentage coverage of respective habitat type per grid cell as predictor to the model. Default:
TRUE
. Only valid ifHab_Abb
not equals to0
.- RiversAsPredictor
Logical indicating whether to include the total length of rivers per grid cell as predictor to the model. Default:
TRUE
. See River_Length for more details.- NspPerGrid
Integer. Indicating the minimum number of species per grid cell for a grid cell to be include in the analysis. This is calculated after filtering grid cells by sampling efforts (
MinEffortsSp
) and filtering species by the number of presence grid cells (PresPerSpecies
). IfNspPerGrid
=0
(default), all grid cells will be used in the models. IfNspPerGrid
> 0, only grid cells with >=NspPerGrid
species presence will be considered in the models.- ExcludeCult
Logical. Indicates whether to exclude countries with cultivated or casual observations per species. Defaults to
TRUE
.- ExcludeZeroHabitat
Logical. Indicates whether to exclude grid cells with zero habitat coverage. Defaults to
TRUE
.- CV_NFolds
Number of cross-validation folds. Default: 4.
- CV_NGrids
For
CV_Dist
cross-validation strategy (see below), this argument determines the size of the blocks (how many grid cells in both directions).- CV_NR, CV_NC
Integer, the number of rows and columns used in the
CV_Large
cross-validation strategy (see below), in which the study area is divided into large blocks given the providedCV_NR
andCV_NC
values. Both default to 2 which means to split the study area into four large blocks at the median latitude and longitude.- CV_Plot
Logical. Indicating whether to plot the block cross-validation folds.
- CV_SAC
Logical. Indicating whether to use the spatial autocorrelation to determine the block size. Defaults to
FALSE
,- PhyloTree, NoPhyloTree
Logical parameters indicating whether to fit models with (PhyloTree) or without (NoPhyloTree) phylogenetic trees. Defaults are
PhyloTree = TRUE
andNoPhyloTree = FALSE
, meaning only models with phylogenetic trees are fitted by default. At least one ofPhyloTree
andNoPhyloTree
should beTRUE
.- OverwriteRDS
Logical. Indicating whether to overwrite previously exported RDS files for initial models. Default:
TRUE
.- NCores
Integer specifying the number of parallel cores for parallelization. Default: 8 cores.
- NChains
Integer specifying the number of model chains. Default: 4.
- thin
Integer specifying the value(s) for thinning in MCMC sampling. If more than one value is provided, a separate model will be fitted at each value of thinning.
- samples
Integer specifying the value(s) for the number of MCMC samples. If more than one value is provided, a separate model will be fitted at each value of number of samples. Defaults to 1000.
- transientFactor
Integer specifying the transient multiplication factor. The value of
transient
will equal the multiplication oftransientFactor
andthin
. Default: 500.- verbose
Integer indicating the interval at which MCMC sampling progress is reported. Default:
200
.- SkipFitted
Logical indicating whether to skip already fitted models. Default:
TRUE
.- NumArrayJobs
Integer specifying the maximum number of array jobs per SLURM script. Default: 210. See LUMI documentation for more details.
- ModelCountry
String or vector of strings specifying the country or countries to filter observations by. Default:
NULL
, which means prepare data for the whole Europe.- VerboseProgress
Logical. Indicates whether progress messages should be displayed. Defaults to
TRUE
.- FromHPC
Logical indicating whether the work is being done from HPC, to adjust file paths accordingly. Default:
TRUE
.- PrepSLURM
Logical indicating whether to prepare SLURM command files. If
TRUE
(default), the SLURM commands will be saved to disk using the Mod_SLURM function.- MemPerCpu
String specifying the memory per CPU for the SLURM job. This value will be assigned to the
#SBATCH --mem-per-cpu=
SLURM argument. Example: "32G" to request 32 gigabyte. Only effective ifPrepSLURM = TRUE
.- Time
String specifying the requested time for each job in the SLURM bash arrays. Example: "01:00:00" to request an hour. Only effective if
PrepSLURM = TRUE
.- JobName
String specifying the name of the submitted job(s) for SLURM. If
NULL
(Default), the job name will be prepared based on the folder path and theHab_Abb
value. Only effective ifPrepSLURM = TRUE
.- Path_Hmsc
String specifying the path for the Hmsc-HPC. This will be provided as the
Path_Hmsc
argument of the Mod_SLURM function.- CheckPython
Logical indicating whether to check if the Python executable exists. Only valid if FromHPC =
FALSE
.- ToJSON
Logical indicating whether to convert unfitted models to JSON before saving to RDS file. Default:
FALSE
.- Precision
Integer, either of 32 (default;
--fp 32
) or 64 for the precision mode used for sampling while fittingHmsc-HPC
models (--fp 64
argument). InHmsc-HPC
, the default value is 64. This is still under testing.- ...
Additional parameters provided to the Mod_SLURM function.
Value
The function is used for its side effects of preparing data and models for HPC and does not return any value.
Details
The function provides options for:
for which habitat types the models will be fitted
excluding grid cells with very low sampling efforts (
MinEffortsSp
)selection of species based on minimum number of presence-grid cells:
PresPerSpecies
.optionally model fitting on specified list of countries: (
ModelCountry
)whether to exclude grid cells with few species (
NspPerGrid
)number of cross-validation folds
options for whether or not to include phylogenetic information to the model
different values for knot distance for GPP (
GPP_Dists
)which Bioclimatic variables to be uses in the models (
BioVars
)whether to include sampling efforts
EffortsAsPredictor
, percentage of respective habitat type per grid cellHabAsPredictor
, and railway and road intensity per grid cellRoadRailAsPredictor
Hmsc options (
NChains
,thin
,samples
,transientFactor
, andverbose
)prepare SLURM commands (
PrepSLURM
) and some specifications (e.g.NumArrayJobs
,MemPerCpu
,Time
,JobName
)The function reads the following environment variables:
DP_R_Grid
(ifFromHPC = TRUE
) orDP_R_Grid_Local
(ifFromHPC = FALSE
). The function reads the content of theGrid_10_Land_Crop.RData
file from this path.DP_R_TaxaInfo
orDP_R_TaxaInfo_Local
for the location of theSpecies_List_ID.txt
file representing species information.DP_R_EUBound_sf
orDP_R_EUBound_sf_Local
for the path of theRData
file containing the country boundaries (sf
object)DP_R_PA
orDP_R_PA_Local
: The function reads the contents of theSp_PA_Summary_DF.RData
file from this path