Skip to contents



This article outlines the preparation of input data for model fitting and the subsequent process of fitting these models on GPUs within the IASDT workflow.



Model input data

The primary function for preparing model-fitting data and initialising models is mod_prepare_HPC(). It structures data for each habitat-specific model into distinct directories (e.g., datasets/processed/model_fitting/HabX, where X represents a habitat type). This function orchestrates a suite of specialised sub-functions to perform the following tasks:


hab_abb abbreviation of a single habitat type to be modelled
directory_name directory path for storing all model files
min_efforts_n_species minimum number of vascular plant species per grid cell required for inclusion in model fitting. This reflects the total count of vascular plant species (including native species) recorded in GBIF across Europe, as computed during the sampling effort preparation step (efforts_process()). This argument filters out grid cells with insufficient sampling effort
exclude_cultivated whether to exclude countries with cultivated or casual observations for each species
exclude_0_habitat whether to exclude grid cells with zero habitat coverage of the respective habitat type
n_pres_per_species minimum number of presence grid cells required for a species to be included in the models, calculated after excluding grid cells with low sampling effort (min_efforts_n_species), zero habitat coverage (exclude_0_habitat), and countries with cultivated or casual observations (exclude_cultivated)



  • mod_CV_prepare(): prepare and visualise options for spatial-block cross-validation. In the CV_Dist strategy, block size is governed by the CV_n_grids argument, whereas in the CV_Large strategy, the study area is partitioned into larger blocks based on the CV_n_rows and CV_n_columns arguments.
CV_n_folds number of cross-validation folds
CV_n_grids number of grid cells in each directions for the CV_Dist cross-validation strategy (default: 20, yielding 20 × 20 grid cell blocks).
CV_n_rows / CV_n_columns number of rows and columns defining in the CV_Large cross-validation strategy, partitioning the study area into large blocks (default: CV_n_rows = CV_n_columns = 2, resulting in four blocks divided at median coordinates).



  • prepare_knots(): prepare and visualise knot locations for Gaussian Predictive Process (GPP) models, as described by Tikhonov et al. (2019).
GPP whether to incorporate spatial random effects using the Gaussian Predictive Process (GPP)
GPP_dists distance (in kilometres; controlled by the min_distance argument of prepare_knots()) specifying both the spacing between knots and the minimum distance between a knot and the nearest sampling point
GPP_plot whether to plot the coordinates of sampling units and knots
min_LF / max_LF minimum and maximum number of latent factors to be include
alphapw prior specification for the alpha parameter



  • mod_SLURM(): generate SLURM scripts to facilitate model fitting on GPUs using the Hmsc-HPC extension.
job_name name assigned to the SLURM job
ntasks number of tasks to execute
cpus_per_task / gpus_per_node Number of CPUs and GPUs allocated per node
memory_per_cpu memory allocation per CPU
job_runtime maximum duration for job execution
HPC_partition name of the HPC partition
n_array_jobs number of jobs within each SLURM script



Other arguments:

  • selection of predictors:
bio_variables names of CHELSA variables to include in the model
quadratic_variables names of variables for which quadratic terms are incorporated
efforts_as_predictor whether to include the (log10-transformed) sampling effort as a predictor
road_rail_as_predictor whether to include the (log10-transformed) summed road and railway intensity as a predictor
habitat_as_predictor whether to include the (log10-transformed) percentage coverage of the respective habitat type per grid cell as a predictor
river_as_predictor whether to include the (log10-transformed) total river length per grid cell as a predictor



  • model fitting options
MCMC_n_chains number of MCMC chains
MCMC_thin thinning value(s) in MCMC sampling
MCMC_samples number of MCMC samples per chain
MCMC_transient_factor transient multiplication factor. The value of transient will equal the multiplication of transientFactor and thin
MCMC_verbose interval at which MCMC sampling progress is reported
precision floating-point precision mode for Hmsc-HPC sampling



  • n_species_per_grid: minimum number of IAS per grid cell for a grid cell to be included in the analysis
  • model_country: fit the model for a specific country or countries
  • whether or not to use phylogenetic trees: use_phylo_tree and no_phylo_tree
  • path_Hmsc: directory path to Hmsc-HPC extension installation
  • SLURM_prepare: whether to prepare SLURM script for model fitting on GPU via mod_SLURM()

Model fitting on GPUs

Following the preparation of model input data and initialisation of models, the subsequent phase involves fitting these models on GPUs. For each habitat type, the mod_prepare_HPC() function produces:

  • python commands (Commands2Fit.txt) for fitting model chains across all model variants on GPUs, with each line corresponding to a single chain.


  • one or more SLURM script files (Bash_Fit.slurm) designed to submit all model-fitting commands (Commands2Fit.txt) as batch jobs on a high-performance computing (HPC) system.




Batch jobs for model fitting can be submitted using the sbatch command, for example:

sbatch datasets/processed/model_fitting/Hab1/Bash_Fit.slurm

Previous articles:
1. Overview
2. Processing abiotic data
3. Processing biotic data
Next articles:
5. Model post-processing