Skip to contents



This article outlines the preparation of input data for model fitting and the subsequent process of fitting these models on GPUs within the IAS-pDT workflow.



Model input data

The primary function for preparing model-fitting data and initializing models is Mod_Prep4HPC(). It structures data for each habitat-specific model into distinct directories (e.g., datasets/processed/model_fitting/HabX, where X represents a habitat type). This function orchestrates a suite of specialized sub-functions to perform the following tasks:


  • Mod_PrepData(): prepare input data for modeling, with key arguments including:
Hab_Abb abbreviation of a single habitat type to be modeled
Path_Model directory path for storing all model files
MinEffortsSp minimum number of vascular plant species per grid cell required for inclusion in model fitting. This reflects the total count of vascular plant species (including native species) recorded in GBIF across Europe, as computed during the sampling effort preparation step (Efforts_Process()). This argument filters out grid cells with insufficient sampling effort
ExcludeCult whether to exclude countries with cultivated or casual observations for each species
ExcludeZeroHabitat whether to exclude grid cells with zero habitat coverage of the respective habitat type
PresPerSpecies minimum number of presence grid cells required for a species to be included in the models, calculated after excluding grid cells with low sampling effort (MinEffortsSp), zero habitat coverage (ExcludeZeroHabitat), and countries with cultivated or casual observations (ExcludeCult)



  • Mod_GetCV(): prepare and visualize options for spatial-block cross-validation. In the CV_Dist strategy, block size is governed by the CV_NGrids argument, whereas in the CV_Large strategy, the study area is partitioned into larger blocks based on the CV_NR and CV_NC arguments.
CV_NFolds number of cross-validation folds
CV_NGrids number of grid cells in each directions for the CV_Dist cross-validation strategy (default: 20, yielding 20 × 20 grid cell blocks).
CV_NR / CV_NC number of rows and columns defining in the CV_Large cross-validation strategy, partitioning the study area into large blocks (default: CV_NR = CV_NC = 2, resulting in four blocks divided at median coordinates).



  • Mod_PrepKnots(): prepare and visualize knot locations for Gaussian Predictive Process (GPP) models, as described by Tikhonov et al. (2019).
GPP whether to incorporate spatial random effects using the Gaussian Predictive Process (GPP)
GPP_Dists distance (in kilometers) specifying both the spacing between knots and the minimum distance between a knot and the nearest sampling point
GPP_Plot whether to plot the coordinates of sampling units and knots
MinLF / MaxLF minimum and maximum number of latent factors to be include
Alphapw prior specification for the alpha parameter



  • Mod_SLURM(): generate SLURM scripts to facilitate model fitting on GPUs using the Hmsc-HPC extension.
JobName name assigned to the SLURM job
ntasks number of tasks to execute
CpusPerTask / GpusPerNode Number of CPUs and GPUs allocated per node
MemPerCpu memory allocation per CPU
Time maximum duration for job execution
Partition name of the HPC partition
NumArrayJobs number of jobs within each SLURM script



Other arguments:

  • selection of predictors:
BioVars names of CHELSA variables to include in the model
QuadraticVars names of variables for which quadratic terms are incorporated
EffortsAsPredictor whether to include the (log10-transformed) sampling effort as a predictor
RoadRailAsPredictor whether to include the (log10-transformed) summed road and railway intensity as a predictor
HabAsPredictor whether to include the (log10-transformed) percentage coverage of the respective habitat type per grid cell as a predictor
RiversAsPredictor whether to include the (log10-transformed) total river length per grid cell as a predictor



  • model fitting options
NChains number of MCMC chains
thin thinning value(s) in MCMC sampling
samples number of MCMC samples per chain
transientFactor transient multiplication factor. The value of transient will equal the multiplication of transientFactor and thin
verbose interval at which MCMC sampling progress is reported
Precision floating-point precision mode for Hmsc-HPC sampling



  • NspPerGrid: minimum number of IAS per grid cell for a grid cell to be included in the analysis
  • ModelCountry: fit the model for a specific country or countries
  • whether or not to use phylogenetic trees: PhyloTree and NoPhyloTree
  • Path_Hmsc: directory path to Hmsc-HPC extension installation
  • PrepSLURM: whether to prepare SLURM script for model fitting on GPU via Mod_SLURM()

Model fitting on GPUs

Following the preparation of model input data and initialization of models, the subsequent phase involves fitting these models on GPUs. For each habitat type, the Mod_Prep4HPC() function produces:

  • python commands (Commands2Fit.txt) for fitting model chains across all model variants on GPUs, with each line corresponding to a single chain.


  • one or more SLURM script files (Bash_Fit.slurm) designed to submit all model-fitting commands (Commands2Fit.txt) as batch jobs on a high-performance computing (HPC) system.




Batch jobs for model fitting can be submitted using the sbatch command, for example:

sbatch datasets/processed/model_fitting/Hab1/Bash_Fit.slurm

Previous articles:
1. Overview
2. Processing abiotic data
3. Processing biotic data
Next articles:
5. Model post-processing