Model pipeline for post-processing fitted Hmsc models

These functions post-process fitted Hmsc models on both CPU and GPU. The main functions in the pipeline includes mod_postprocess_1_CPU, mod_prepare_TF, and mod_postprocess_2_CPU for full models without cross-validation, as well as mod_postprocess_CV_1_CPU and mod_postprocess_CV_2_CPU for cross-validated models. See details for more information.

Usage

mod_postprocess_1_CPU(
  model_dir = NULL,
  hab_abb = NULL,
  n_cores = 8L,
  strategy = "multisession",
  env_file = ".env",
  path_Hmsc = NULL,
  memory_per_cpu = "64G",
  job_runtime = "01:00:00",
  from_JSON = FALSE,
  GPP_dist = NULL,
  use_trees = "Tree",
  MCMC_n_samples = 1000L,
  MCMC_thin = NULL,
  n_omega = 1000L,
  CV_name = c("CV_Dist", "CV_Large"),
  n_grid = 50L,
  use_TF = TRUE,
  TF_use_single = FALSE,
  LF_n_cores = n_cores,
  LF_temp_cleanup = TRUE,
  LF_check = FALSE,
  temp_cleanup = TRUE,
  TF_environ = NULL,
  pred_new_sites = TRUE,
  n_cores_VP = 10L,
  width_omega = 26,
  height_omega = 22.5,
  width_beta = 25,
  height_beta = 35,
  spatial_model = TRUE,
  future_max_size = 1500L,
  CC_models = c("GFDL-ESM4", "IPSL-CM6A-LR", "MPI-ESM1-2-HR", "MRI-ESM2-0",
    "UKESM1-0-LL"),
  CC_scenario = c("ssp126", "ssp370", "ssp585"),
  clamp_pred = TRUE,
  fix_efforts = "q90",
  fix_rivers = "q90",
  tar_predictions = TRUE,
  plot_predictions = TRUE
)

mod_prepare_TF(
  process_VP = TRUE,
  process_LF = TRUE,
  n_batch_files = 210L,
  env_file = ".env",
  working_directory = NULL,
  partition_name = "small-g",
  LF_runtime = "01:00:00",
  model_prefix = NULL,
  VP_runtime = "02:00:00"
)

mod_postprocess_2_CPU(
  model_dir = NULL,
  hab_abb = NULL,
  n_cores = 8L,
  strategy = "multisession",
  env_file = ".env",
  GPP_dist = NULL,
  use_trees = "Tree",
  MCMC_n_samples = 1000L,
  MCMC_thin = NULL,
  use_TF = TRUE,
  TF_environ = NULL,
  TF_use_single = FALSE,
  LF_n_cores = n_cores,
  LF_check = FALSE,
  LF_temp_cleanup = TRUE,
  temp_cleanup = TRUE,
  n_grid = 50L,
  CC_models = c("GFDL-ESM4", "IPSL-CM6A-LR", "MPI-ESM1-2-HR", "MRI-ESM2-0",
    "UKESM1-0-LL"),
  CC_scenario = c("ssp126", "ssp370", "ssp585"),
  RC_n_cores = 8L,
  clamp_pred = TRUE,
  fix_efforts = "q90",
  fix_rivers = "q90",
  pred_new_sites = TRUE,
  tar_predictions = TRUE,
  RC_prepare = TRUE,
  RC_plot = TRUE,
  VP_prepare = TRUE,
  VP_plot = TRUE,
  predict_suitability = TRUE,
  plot_predictions = TRUE,
  plot_LF = TRUE,
  plot_internal_evaluation = TRUE
)

mod_postprocess_CV_1_CPU(
  model_dir = NULL,
  CV_names = NULL,
  n_cores = 8L,
  strategy = "multisession",
  env_file = ".env",
  from_JSON = FALSE,
  use_TF = TRUE,
  TF_use_single = FALSE,
  TF_environ = NULL,
  LF_n_cores = n_cores,
  LF_only = TRUE,
  LF_temp_cleanup = TRUE,
  LF_check = FALSE,
  LF_runtime = "01:00:00",
  temp_cleanup = TRUE,
  n_batch_files = 210L,
  working_directory = NULL,
  partition_name = "small-g"
)

mod_postprocess_CV_2_CPU(
  model_dir = NULL,
  CV_names = NULL,
  n_cores = 8L,
  strategy = "multisession",
  env_file = ".env",
  use_TF = TRUE,
  TF_use_single = FALSE,
  temp_cleanup = TRUE,
  LF_temp_cleanup = TRUE,
  TF_environ = NULL,
  LF_n_cores = n_cores,
  LF_check = FALSE
)

Arguments

model_dir: Character. Path to the root directory of the fitted model.
hab_abb: Character. Habitat abbreviation indicating the specific SynHab habitat type. Valid values: 0, 1, 2, 3, 4a, 4b, 10, 12a, 12b. See Pysek et al. for details.
n_cores: Integer. Number of CPU cores to use for parallel processing. Default: 8.
strategy: Character. The parallel processing strategy to use. Valid options are "sequential", "multisession" (default), "multicore", and "cluster". See future::plan() and ecokit::set_parallel() for details.
env_file: Character. Path to the environment file containing paths to data sources. Defaults to .env.
path_Hmsc: Character. Path to the Hmsc-HPC installation.
memory_per_cpu: Character. Memory allocation per CPU core. Example: "32G" for 32 gigabytes. Defaults to "64G".
job_runtime: Character. Maximum allowed runtime for jobs for refitting the models (if needed) and cross validating models. Defaults to "01:00:00" for one hour. If not provided, the function throws an error.
from_JSON: Logical. Whether to convert loaded models from JSON format before reading. Defaults to FALSE.
GPP_dist: Integer. Distance in kilometres between knots for the selected model.
use_trees: Character. Whether a phylogenetic tree was used in the selected model. Accepts "Tree" (default) or "NoTree".
MCMC_thin, MCMC_n_samples: Integer. Thinning value and the number of MCMC samples of the selected model.
n_omega: Integer. The number of species to be sampled for the Omega parameter transformation. Defaults to 100.
CV_name: NULL or character vector. Column name(s) in the model input data to be used to cross-validate the models (see mod_prepare_data and mod_CV_prepare). If CV_name = NULL, no cross-validation data preparation is done. See mod_CV_fit for valid options.
n_grid: Integer. Number of points along the gradient for continuous focal variables. Higher values result in smoother curves. Default: 50. See Hmsc::constructGradient for details.
use_TF: Logical. Whether to use TensorFlow for calculations. Defaults to TRUE.
TF_use_single: Logical. Whether to use single precision for the TensorFlow calculations. Defaults to FALSE.
LF_n_cores: Integer. Number of cores to use for parallel processing of latent factor prediction. Defaults to 8L.
LF_temp_cleanup: Logical. Whether to delete temporary files in the temp_dir directory after finishing the LF predictions.
LF_check: Logical. If TRUE, the function checks if the output files are already created and valid. If FALSE, the function will only check if the files exist without checking their integrity. Default is FALSE.
temp_cleanup: Logical. Whether to clean up temporary files. Defaults to TRUE.
TF_environ: Character. Path to the Python environment. This argument is required if use_TF is TRUE under Windows. Defaults to NULL.
pred_new_sites: Logical. Whether to predict suitability at new sites. Default: TRUE.
n_cores_VP: Integer. Number of cores to use for processing variance partitioning. Defaults to 10L.
width_omega, height_omega, width_beta, height_beta: Integer. The width and height of the generated heatmaps of the Omega and Beta parameters in centimetres.
spatial_model: Logical. Whether the model is spatial (TRUE) or not (FALSE). Defaults to TRUE.
future_max_size: Numeric. Maximum allowed total size (in megabytes) of global variables identified. See future.globals.maxSize argument of future::future.options for more details.
CC_models: Character vector. Climate models for future predictions. Available options are c("GFDL-ESM4", "IPSL-CM6A-LR", "MPI-ESM1-2-HR", "MRI-ESM2-0", "UKESM1-0-LL") (default).
CC_scenario: Character vector. Climate scenarios for future predictions. Available options are: c("ssp126", "ssp370", "ssp585") (default).
clamp_pred: Logical indicating whether to clamp the sampling efforts at a single value. If TRUE (default), the fix_efforts argument must be provided.
fix_efforts: Numeric or character. When clamp_pred = TRUE, fixes the sampling efforts predictor at this value during predictions. If numeric, uses the value directly (on log₁₀ scale). If character, must be one of identity (i.e., do not fix), median, mean, max, or q90 (90% quantile). Using max may reflect extreme sampling efforts from highly sampled locations, while q90 captures high sampling areas without extremes. Required if clamp_pred = TRUE.
fix_rivers: Numeric, character, or NULL. Similar to fix_efforts, but for the river length predictor. If NULL, the river length is not fixed. Default: q90.
tar_predictions: Logical. Whether to compress the add files into a single *.tar file (without compression). Default: TRUE.
plot_predictions: Logical. Whether to plot species and species richness predictions as JPEG files (using plot_prediction). Defaults to TRUE.
process_VP: Logical. Whether to prepares batch scripts for variance partitioning GPU computations on GPUs. Defaults to TRUE.
process_LF: Logical. Whether to prepares batch scripts for latent factor predictions GPU computations on GPUs. Defaults to TRUE.
n_batch_files: Integer. Number of output batch files to create. Must be less than or equal to the maximum job limit of the HPC environment.
working_directory: Character. Optionally sets the working directory in batch scripts to this path. If NULL, the directory remains unchanged.
partition_name: Character. Name of the partition to submit the SLURM jobs to. Default is small-g.
LF_runtime, VP_runtime: Character. Time limit for latent factor prediction and variance partitioning processing jobs, respectively. Defaults are 01:00:00 and 02:00:00 respectively.
model_prefix: Character. Prefix for the model name. A directory named model_prefix_TF is created in the model_dir to store the TensorFlow running commands. Defaults to NULL. This can not be NULL.
RC_n_cores: Integer. The number of cores to use for response curve prediction. Defaults to 8.
RC_prepare: Logical. Whether to prepare the data for response curve prediction (using resp_curv_prepare_data). Defaults to TRUE.
RC_plot: Logical. Whether to plot the response curves as JPEG files (using resp_curv_plot_SR, resp_curv_plot_species, and resp_curv_plot_species_all). Defaults to TRUE.
VP_prepare: Logical. Whether to prepare the data for variance partitioning (using variance_partitioning_compute). Defaults to TRUE.
VP_plot: Logical. Whether to plot the variance partitioning results (using variance_partitioning_plot). Defaults to TRUE.
predict_suitability: Logical. Whether to predict habitat suitability across different climate options (using predict_maps). Defaults to TRUE.
plot_LF: Logical. Whether to plot latent factors as JPEG files (using plot_latent_factor). Defaults to TRUE.
plot_internal_evaluation: Logical. Whether to compute and visualise model internal evaluation (explanatory power) using plot_evaluation. Defaults to TRUE.
CV_names: Character vector. Names of cross-validation strategies to merge, matching those used during model setup. Defaults to c("CV_Dist", "CV_Large"). The names should be one of CV_Dist, CV_Large, or CV_SAC. Applies only to mod_merge_chains_CV.
LF_only: Logical. Whether to predict only the latent factor. This is useful for distributing processing load between GPU and CPU. When LF_only = TRUE, latent factor prediction needs to be computed separately on GPU. When computations are finished on GPU, the function can later be rerun with LF_only = FALSE (default) to predict habitat suitability using the already-computed latent factor predictions.

Details

mod_postprocess_1_CPU

This function performs the initial post-processing step for habitat-specific fitted models, automating the following tasks:

check unsuccessful models: mod_SLURM_refit
merge chains and save R objects (fitted model object and coda object) to qs2 or RData files: mod_merge_chains
visualise the convergence of all model variants fitted convergence_plot_all
visualise the convergence of selected model, including plotting Gelman-Rubin-Brooks plot_gelman and convergence_plot for model convergence diagnostics of the rho, alpha, omega, and beta parameters.
extract and save model summary: mod_summary
plotting model parameters: mod_heatmap_omega, mod_heatmap_beta
prepare data for cross-validation and fit initial cross-validated models: mod_CV_fit
Prepare scripts for GPU processing, including:
- predicting latent factors of the response curves: resp_curv_prepare_data
- predicting latent factors for new sampling units: predict_maps
- computing variance partitioning: variance_partitioning_compute

mod_prepare_TF

After running mod_postprocess_1_CPU for all habitat types, this function prepares batch scripts for GPU computations of all habitat types:

for variance partitioning, the function matches all files with the pattern "VP_.+Command.txt" (created by variance_partitioning_compute and merges their contents into a single file (model_prefix_TF/VP_Commands.txt). Then, it prepares a SLURM script for variance partitioning computations (model_prefix_TF/VP_SLURM.slurm).
for latent factor predictions, the function matches all files with the pattern "^LF_NewSites_Commands_.+.txt|^LF_RC_Commands_.+txt" and split their contents into multiple scripts at the model_prefix_TF directory for processing as a batch job. The function prepares a SLURM script for latent factor predictions (LF_SLURM.slurm).

This function is tailored for the LUMI HPC environment and assumes that the tensorflow module is installed and correctly configured with all required Python packages. On other HPC systems, users may need to modify the function to load a Python virtual environment or install the required dependencies for TensorFlow and related packages.

mod_postprocess_2_CPU

This function continues running the analysis pipeline for post-processing Hmsc by automating the following steps:

process and visualise response curves: response_curves
predict habitat suitability across different climate options: predict_maps
plot species & SR predictions as JPEG: plot_prediction
plot latent factors as JPEG: plot_latent_factor
process and visualise variance partitioning: variance_partitioning_compute and variance_partitioning_plot
compute and visualizing model internal evaluation (explanatory power): plot_evaluation
initiate post-processing of fitted cross-validated models: prepare commands for latent factor predictions on GPU — Ongoing

This function should be run after:

completing mod_postprocess_1_CPU and mod_prepare_TF on CPU,
running VP_SLURM.slurm and LF_SLURM.slurm on GPU to process response curves and latent factor predictions (both scripts are generated by mod_prepare_TF).
submitting SLURM jobs for cross-validated model fitting.

mod_postprocess_CV_1_CPU

This function is similar to mod_postprocess_1_CPU, but it is specifically designed for cross-validated models. It automates merging fitted cross-validated model chains into Hmsc model objects and prepare scripts for latent factor prediction on TensorFlow using predict_maps_CV.

mod_postprocess_CV_2_CPU

The function 1) processes *.feather files resulted from Latent Factor predictions (using TensorFlow) and saves LF predication to disk; 2) predicts species-specific mean habitat suitability at testing cross-validation folds and calculates testing evaluation metrics; 3) generates plots of the evaluation metrics.

Author

Ahmed El-Gabbas